Random array dna analysis by hybridization

ABSTRACT

The invention relates to methods and devices for analyzing single molecules, i.e., nucleic acids. Such single molecules may be derived from natural samples, such as cells, tissues, soil, air and water without separating or enriching individual components. In certain aspects of the invention, the methods and devices are useful in performing nucleic acid sequence analysis by probe hybridization.

1. CROSS-REFERENCES TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.14/335,827 (pending), filed Jul. 18, 2014; which is a continuation ofU.S. patent application Ser. No. 13/633,034 (now U.S. Pat. No.8,785,127), filed Oct. 1, 2012; which is a continuation of U.S. patentapplication Ser. No. 11/981,797 (now U.S. Pat. No. 8,278,039), filedOct. 31, 2007; which is a continuation of U.S. patent application Ser.No. 10/547,214 (now U.S. Pat. No. 8,105,771), filed Jun. 29, 2006; whichis the U.S. National Stage of PCT/US04/06022, filed Feb. 26, 2004, andpublished as WO2004/076683 on Sep. 10, 2004; which claims the prioritybenefit of U.S. Provisional Application 60/450,566, filed Feb. 26, 2003.

Related subject matter is disclosed in U.S. patent application Ser. No.10/738,108 filed Dec. 16, 2003, entitled “Single Target MoleculeAnalysis by Compiling Multiple Transient Interactions with ProbeMolecules,” which claims the priority benefit of U.S. ProvisionalApplication Ser. No. 60/435,539 filed Dec. 20, 2002 entitled “SingleTarget Molecule Analysis by Compiling Multiple Transient Interactionswith probe Molecules.

These and all other patents and patent applications cited herein areherein incorporated by reference in their entirety for all purposes.

2. BACKGROUND OF THE INVENTION

2.1 Technical Field

The invention relates to methods for analyzing molecules and devices forperforming such analysis. The methods and devices allow reliableanalysis of a single molecule of nucleic acids. Such single moleculesmay be derived from natural samples such as cells, tissues, soil, air,water, without separating or enriching individual components. In certainaspects of the invention, the methods and devices are useful inperforming nucleic acid sequence analysis or nucleic acid quantificationincluding gene expression.

2.2. Background

There are three established DNA sequencing technologies. The dominantsequencing method used today is based on Sanger's dideoxy chaintermination process (Sanger et al., Proc. Nati. Acad. Sci. USA 74:5463(1977) herein incorporated by reference in its entirety) and relies onvarious gel-based separation instruments ranging from manual systems tofully automated capillary sequencers. The Sanger process is technicallydifficult and is limited to read lengths of about 1 kb or less,requiring multiple reads to achieve high accuracy. A second method,pyrosequencing, also uses polymerase to generate sequence information bymonitoring production of pyrophosphate generated during consecutivecycles in which specific DNA bases are tested for incorporation into thegrowing chain (Ronaghi, Genome Res. 11:3 (2001), herein incorporated byreference in its entirety). The method provides an elegant multi-wellplate assay, but only for local sequencing of very short 10-50 basefragments. This read length restriction represents a serious limitationfor sequence-based diagnostics.

Both of the above technologies represent direct sequencing methods inwhich each base position in a chain is determined sequentially by directexperimentation. Sequencing by hybridization (SBH) (U.S. Pat. No.5,202,231; Drmanac et al., Genomics 4:114 (1989), both of which areherein incorporated by reference in their entirety), uses thefundamental life chemistry of base-specific hybridization ofcomplementary nucleic acids to indirectly assemble the order of bases ina target DNA. In SBH, overlapping probes of known sequence arehybridized to sample DNA molecules and the resulting hybridizationpattern is used to generate the target sequence using computeralgorithms (co-owned, co-pending U.S. patent application Ser. No.09/874,772; Drmanac et al., Science 260:1649-1652 (1993); Drmanac etal., Nat. Biotech. 16:54-58 (1998); Drmanac et al., “Sequencing andFingerprinting DNA by Hybridization with Oligonucleotide Probes,” In:Encyclopedia of Analytical Chemistry, pp. 5232-5237 (2000); Drmanac etal., “Sequencing by Hybridization (SBH): Advantages, Achievements, andOpportunities,” In: Advances in Biochemical Engineering/Biotechnology:Chip Technology, Hoheisel, J. (Ed.), Vol. 76, pp. 75-98 (2002), all ofwhich are herein incorporated by reference in their entirety). Probes orDNA targets may be arrayed in the form of high-density arrays (see, forexample, Cutler et al., Genome Res. 11:1913-1925 (2001), hereinincorporated by reference in its entirety). Advantages of the SBH methodinclude experimental simplicity, longer read length, higher accuracy,and multiplex sample analysis in a single assay.

Currently, there is a critical need for new biodefense technologies thatcan quickly and accurately detect, analyze, and identify all potentialpathogens in complex samples. Current pathogen detection technologiesgenerally lack the sensitivity and selectivity to accurately identifytrace quantities of pathogens in such samples and are often expensiveand difficult to operate. In addition, in their current implementations,all three sequencing technologies require large quantities of sampleDNA. Samples are usually prepared by one of several amplificationmethods, primarily PCR. These methods, especially SBH, can provide goodsequence-based diagnostics of individual genes or mixtures of 2-5 genes,although with substantial cost associated with DNA amplification andarray preparation. Thus, all current sequencing methods lack the speedand efficiency needed to provide at acceptable cost comprehensivesequence-based pathogen diagnostics and screening in complex biologicalsamples. This creates a wide gap between current technical capacity andnew sequencing needs. Ideally, a suitable diagnostics process shouldpermit a simultaneous survey of all critical pathogens potentiallypresent in environmental or clinical samples, including mixtures inwhich engineered pathogens are hidden among organisms.

The requirements for such comprehensive pathogen diagnostics include theneed to sequence 10-100 critical genes or entire genomes simultaneouslyfor hundreds of pathogens and to process thousands of samples.Ultimately, this will require sequencing 10-100 Mb of DNA per sample, or100 Mb to 10 Gb of DNA per day for a lab performing continuoussystematic surveys. Current sequencing methods have over 100 fold lowersequencing throughput and 100 fold higher cost than is required for suchcomprehensive pathogen diagnostics and pre-symptomatic surveys.

Current biosensor technologies use a variety of molecular recognitionstrategies, including antibodies, nucleic acid probes, aptamers,enzymes, bioreceptors, and other small molecule ligands (Iqbal et al.,Biosensors and Bioelectronics 15:549-578 (2000), herein incorporated byreference in its entirety). Molecular recognition elements must becoupled to a reporter molecule or tag to allow positive detectionevents.

Both DNA hybridization and antibody-based technologies are alreadywidely used in pathogen diagnostics. Nucleic acid-based technologies aregenerally more specific and sensitive than antibody-based detection, butcan be time consuming and less robust (Iqbal et al., 2000, supra). DNAamplification (through PCR or cloning) or signal amplification isgenerally necessary to achieve reliable signal strength and accurateprior sequence knowledge is required to construct pathogen-specificprobes. Although development of monoclonal antibodies has increased thespecificity and reliability of immunoassays, the technology isrelatively expensive and prone to false positive signals (Doing et al.,J. Clin. Microbial. 37:1582-1583 (1999); Marks, Clin. Chem. 48:2008-2016(2002), both of which are herein incorporated by reference in theirentirety). Other molecular recognition technologies such as phagedisplay, aptamers and small molecule ligands are still in their earlystages of development and not yet versatile enough to address allpathogen detection problems.

The main liability of all current diagnostic technologies is that theylack the sensitivity and versatility to detect and identify allpotential pathogens in a sample. Weapons designers can easily engineernew biowarfare agents to foil most pathogen-specific probes orimmunoassays. There is a clear urgent need for efficient sequence-baseddiagnostics.

To this end, Applicants have developed a high-efficiency genomesequencing system, random DNA array-based sequencing by hybridization(rSBH). rSBH can be useful for genomic sequence analysis of all genomespresent in complex microbial communities as well as individual humangenome sequencing. rSBH eliminates the need for DNA cloning or DNAseparation and reduces the cost of sequencing using methods known in theart.

4. SUMMARY OF THE INVENTION

The present invention provides novel methods, compositions or mixturesand apparatuses capable of analyzing single molecules of DNA to rapidlyand accurately sequence any long DNA fragment, mixture of fragments,entire gene, mixture of genes, mixtures of mRNAs, long segments ofchromosomes, entire chromosomes, mixtures of chromosomes, entire genome,or mixtures of genomes. Additionally, the present invention providesmethods for identifying a nucleic acid sequence within a target nucleicacid. Through consecutive transient hybridizations, accurate andextensive sequence information is obtained from the compiled data. In anexemplary embodiment, a single target molecule is transiently hybridizedto a probe or population of probes. After the hybridization ceases toexist with one or more probes, the target molecule again is transientlyhybridized to a next probe or population of probes. The probe orpopulation of probes may be identical to those of the previous transienthybridization or they may be different. Compiling a series ofconsecutive bindings of the same single target molecule with one or moremolecules of probe of the same type provides reliable measurements.Thus, because it is consecutively contacted with probes, a single targetmolecule can provide a sufficient amount of data to identify a sequencewithin the target molecule. By compiling the data, the nucleic acidsequence of the entire target molecule can be determined

Further provided by the present invention are methods, compositions andapparatuses for analyzing and detecting pathogens present in complexbiological samples at the single organism level and identifying allvirulence controlling genes.

The present invention provides a method of analyzing a target moleculecomprising the steps of:

-   -   a) contacting the target molecule with one or more probe        molecules in a series of consecutive binding reactions, wherein        each association produces an effect on the target molecule or        the probe molecule(s); and    -   b) compiling the effects of the series of consecutive binding        reactions.        The present invention further provides a method of analyzing a        target molecule comprising the steps of:    -   a) contacting the target molecule with one or more probe        molecules in a series of consecutive hybridization/dissociation        reactions, wherein each association produces an effect on the        target molecule or the probe molecule(s); and    -   b) compiling the effects of the series of consecutive        hybridization/dissociation reactions.

In certain embodiments, the series contains at least 5, at least 10, atleast 25, at least 50, at least 100, or at least 1000 consecutivehybridization/dissociation or binding reactions. In one embodiment, theseries contains at least 5 and less than 50 consecutivehybridization/dissociation or binding reactions.

The present invention includes embodiments wherein the probe moleculesequence or structure is known or is determinable. One such advantage ofsuch embodiments is that they are useful in identifying a sequence inthe target from the compiled effects of the one or more probes ofknown/determinable sequence. Furthermore, when multiple sequences thatoverlap have been identified within a target molecule, such identified,overlapping sequences can be used to sequence the target molecule.

The present invention further provides a method of analyzing a targetmolecule wherein the compilation of effects includes in the analysis ameasurement involving time (i.e. length of time signal detected or thedetection of signal over a preset time period, etc.). In certainembodiments, the effects are compiled by measuring the time that thetarget molecule(s) or probe molecule(s) produce a fluorescent signal.

Also provided by the present invention are methods wherein the effectsare compiled by detecting a signal produced only upon hybridization orbinding of the target molecule to a probe. Such methods include thosewherein the effects are compiled by determining an amount of a timeperiod that the signal is produced and those wherein the effects arecompiled by determining the amount of signal produced. In certainembodiments, the target molecule(s) comprises a fluorescence resonanceenergy transfer (FRET) donor and the probe molecule(s) comprises a FRETacceptor. In other embodiments, the target molecule(s) comprises a FRETacceptor and the probe molecule(s) comprises a FRET donor.

The invention also provides methods wherein the effect on one or moreprobes is modification of the probe(s). In certain embodiments, theprobes are ligated and the method further comprises detecting theligated probes. The probes may be labeled with a nanotag.

In embodiments wherein the effect of hybridization or binding on theprobe(s) is modification, wherein modifications caused by full-matchhybridizations occur more frequently than modifications caused bymismatch hybridizations and a full-match is determinable by thedetection of the occurrence of a relatively higher number ofmodifications.

The methods of the present invention include those wherein:

-   -   a) the target molecule is produced by fragmentation of a nucleic        acid molecule;    -   b) the fragmentation is achieved through restriction enzyme        digestion, ultrasound treatment, sodium hydroxide treatment, or        low pressure shearing;    -   c) the target molecule is detectably labeled;    -   d) the target molecule and/or the probe molecule is detectably        labeled with a label selected from the group consisting of a        fluorescent label, a nanotag, a chemiluminescent label, a        quantum dot, a quantum bead, a fluorescent protein, dendrimers        with a fluorescent label, a micro-transponder, an electron donor        molecule or molecular structure, and a light reflecting        particle;    -   e) the label is detected with a charge-coupled device (CCD);    -   f) probe molecules having the same information region are each        associated with the same detectable label;    -   g) one or more probe molecules comprise multiple labels;    -   h) the probe molecules are divided into pools, wherein each pool        comprises at least two probe molecules having different        information regions, and all probe molecules within each pool        are associated with the same label which is unique to the pool        as compared with every other pool;    -   i) a sequence of the target molecule is assembled by ordering        overlapping probe sequences that hybridize to the target        molecule;    -   j) a sequence of the target molecule is assembled by ordering        overlapping probe sequences and determining the        score/likelihood/probability of the assembled sequence from the        hybridization efficiency of the incorporated probes;    -   k) the probes are each independently between 4 and 20        nucleotides in length in the informative region;    -   l) the probes are each independently between 4 and 100        nucleotides in length in the informative region;    -   m) the target sequence of an attached molecule has a length that        is between about 20 and 20,000 bases;    -   n) one or more of the probes is comprised of at least one        modified or universal base;    -   o) one or more of the probes is comprised of at least one        universal base at a terminal position;    -   p) the hybridization conditions are effective to permit        hybridization between the target molecule and only those probes        that are perfectly complementary to a portion of the target        molecule;    -   q) the contacting comprises at least about 10, at least about        100, at least about 1000, or at least about 10,000 probe        molecules having informative regions that are distinct from each        other; and/or    -   r) fewer than 1000, 800, 600, 400, 200, 100, 75, 50, 25, or 10        target molecules are used.

In one embodiment, the method of the invention can be used for analyzingthe microbial genomes in microbial biofilms and percent compositionthereof. The biofilm community comprises microbes includingLeptospirillum femphilum phylotype, Ferrospirillum sp., Sulfobacillusthermosulfidooxidans phylotype, archaea (including Ferroplasmaacidarmanus, Aplasma, Geneplasma phylotype), and eukaryotes (includingprotests and fungi).

The invention further provides a method for isothermal amplificationusing strand displacement enzymes based on the formation of singlestranded DNA for primer annealing by an invader oligonucleotide.

The invention further provides software that supports rSBH whole-genome(complex DNA samples) and can process as much as 3 Gbp to 10 Gbp ofsequence.

The invention further provides for reagents and kits to simultaneouslyanalyze a plurality of genes or diagnostic regions, process, and preparepathogen DNA from blood samples.

The invention further provides for compositions comprising mixtures ofprobes, target nucleic acids, and ligating molecules to analyze aplurality of pathogen genes or diagnostic regions from blood, tissue, orenvironmental samples.

Numerous additional aspects and advantages of the invention will becomeapparent to those skilled in the art upon consideration of the followingdetailed description of the invention which describes presentlypreferred embodiments thereof.

5. BRIEF DESCRIPTION OF THE DRAWINGS

The detailed description of the invention may be better understood inconjunction with the accompanying figures as follows:

FIG. 1 depicts adapter ligation and extension. Double stranded hairpinadapters (solid lines) are maintained in the hairpin form bycross-linked bases at the hairpin end. B and F represent bound primerand fixed primer sequences, respectively and their complementarysequences are in lower case. Genomic sequence is shown as thin lines. A)Non-phosphorylated adapters are ligated to genomic DNA resulting innicks in the strand with free 3′ ends (arrowhead). B) Extension from the3′ end produces a displaced strand and the replication of adaptersequences.

FIG. 2 depicts adapter design and attachment to a DNA fragment, whereingenomic DNA is represented by a solid black bar, F represents the freeprimer, B represents the bound primer, and f and b represent theircomplements, respectively.

FIG. 3 depicts ampliot production on the chip surface. A) After meltingof the adapter-captured genomic DNA, one strand is captured onto thesurface of the slide by hybridization to bound primer B. Polymeraseextension from primer B produces a double stranded molecule. B) Thetemplate strand is removed by heating and washing of the slide and afree primer F is introduced and extended along the fixed strand. C)Continuous strand displacement amplification by F results in theproduction of a strand that can move to nearby primer B hybridizationsites. D) Displaced strands serve as template for extension from newprimer B sites.

FIG. 4 depicts ampliot production using an RNA intermediate. T7represents the T7 phage RNA polymerase promoter. A) The single strandedadapter region is hybridized to the bound primer B and extended to forma second strand by DNA polymerase resulting in the formation of a doublestranded T7 promoter. B) 17 RNA polymerase produces an RNA copy (dashedline). C) The RNA then binds to a nearby primer B and cDNA is producedby reverse transcriptase. Duplex RNA is then destroyed by RNase H.

FIG. 5 depicts a schematic of the invader-mediated isothermal DNAamplification process.

FIG. 6 depicts the random array sequencing by hybridization (rSBH)process. From the top down: (a) A CCD camera is positioned above thereaction platform and a lens is used to magnify and focus on 1 μm² areasfrom the platform onto individual pixels of the CCD camera. (b) Thearray (˜3 mm×3 mm) consists of 1 million or more 1 μm² areas, which actas virtual reaction wells (each corresponding to individual pixels ofthe CCD camera). Each pixel corresponds to the same location on thesubstrate. In a series of reactions in time, one CCD pixel can combinethe data for several reactions, thereby creating the virtual reactionwell. DNA samples are randomly digested and arrayed onto the surface ofthe reaction platform at an average concentration of one fragment perpixel. (c) The array is subjected to rSBH combinatorial ligation usingone of several informational probe pools. The signals from each pixelare recorded. (d) Probes from the first pool are removed and the arrayis subjected to a second round of rSBH combinatorial ligation using adifferent pool or probes. (e) Insert showing molecular details offluorescence resonance energy transfer (FRET) signal generation due toligation of two adjacent and complementary probes whose compliment isrepresented by the target.

FIG. 7 depicts the rSBH reaction. The total internal reflectionmicroscopy (TIRM) detection system creates an evanescent field in whichenhanced excitation occurs only in the region immediately above theglass substrate. FRET signals are generated when probes are hybridizedto the arrayed target and subsequently ligated, thus positioning theFRET pair within the evanescent field. Unligated probes do not give riseto detectable signals, whether they are free in solution or transientlyhybridized to the target. Hence, the evanescent field of the TIRM systemprovided both intense signals within a desired plane while reducingbackground noise from unreacted probes.

FIG. 8 depicts sequence assembly. In general, in the SBH process, thetarget sequence is assembled using overlapping positive probes. In thisprocess each base is read several times (i.e. 10 times with 10-merprobes, etc.) which assures very high accuracy even if some probes arenot correctly scored.

FIG. 9 depicts a schematic of a microfluidics device for the rSBHprocess. The device integrates DNA preparation, formation of randomsingle molecule DNA arrays, combinatorial pool mixing, and cyclicloading and washing of the reaction chamber. When a sample tube isattached to the chip, a series of reactions is performed with pre-loadedreagents to isolate and fragment DNA, which is randomly attached to thearray surface at a density of approximately one molecule per pixel. Amicrofluidics device is then used to mix two probe pools from 5′ and 3′sets of informative probe pools (IPPs) with the reaction solution. Oneset of probe pools is labeled with a FRET donor, the other with a FRETacceptor. Mixed pools containing DNA ligase are then transferred to areaction chamber above the single molecule DNA. Detectable ligationevents occur when two probes (one from each pool) hybridize to adjacentcomplementary sequences of a target DNA molecule within a narrow zone ofreflectance (˜100 nm) above the array surface. Ligation of 5′ and 3′probes within the zone of reflectance results in a FRET signal that isdetected and scored by an ultra-sensitive CCD camera. After ligationevents are scored, each pool mix is removed by a washing solution and asecond pair of pools from the same sets of IPPs preloaded on themicrofluidics chip is combined and introduced to the reaction chamber.By combining all possible pools within the two sets of IPPs, each targetmolecule in the array is scored for the presence/absence of everypossible combination of probe sequences that exists within the two probesets.

FIGS. 10A, 10B, and 10C depict the basic optics and light path for theTIRM instrument.

FIG. 10A shows a traditional substrate positioned on top of the prismsand the light path that gives rise to an evanescent field. FIGS. 10B and10C show the use of galvanometers to control the light path from thelaser to the prism assembly.

FIGS. 11A to 11E are schematic representations of rSBH components andprocesses. FIGS. 11A, 11B, and 11C show the components of the rSBHinstrument. FIGS. 11D and 11E show stepwise description of theexperimental process. Sample is collected and prepared (FIG. 11D, Steps1 and 2) independent of the instrument. Resultant crude samplepreparation is further processed for rSBH array formation (FIG. 11D,Step 3) by the sample integration module (FIG. 11A). Targets aresubsequently arrayed on the substrate module within the reactioncartridge (FIG. 11B). Samples are subjected to SBH ligation assay (FIG.11E, Step 4) using SBH probes delivered by the probe module (FIG. 11C).Resultant raw data is processed, resulting in assembly of sequence data(FIG. 11E, Step 5) and interpretive analysis (FIG. 11E, Step 6).

FIGS. 12A, 12B, 12C, and 12D show the full-match ligation signal fromfour spotted oligonucleotide targets designated Tgt1, Tgt2, Tgt3, andTgt4, respectively (described in Section 7.7, below). The four differenttargets were spotted at 7 different concentrations ranging from 1 to 90μM. Ligation probe concentration (5′ probe: 3′ probe ratio is 1:1) werevaried from 0.1 to 1 pmole/20 μl.

FIG. 13 shows a graphic representation of the spotted target serving asa capture probe for another target. The ligation signal was measuredwhen the slide was directly hybridized/ligated with Tgt2-5′ probe andTgt2-3′ probe (circles) and when the slide was pre-hybridized withtarget Tgt2-Tgt1-rc and then ligated with Tgt2-5′ probe and Tgt2-3′probe (squares).

6. DETAILED DESCRIPTION OF THE INVENTION

The present invention provides single molecule DNA analysis methods anddevices to rapidly and accurately sequence any long DNA fragment,mixture of fragments, entire genes, mixture of genes, mixtures of mRNAs,long segments of chromosomes, entire chromosomes, mixtures ofchromosomes, entire genome, or mixtures of genomes. The method of theinvention allows detection of pathogens present in complex biologicalsamples at the single organism level and identification of virulencecontrolling genes. The method of the invention combines hybridizationand especially sequencing by hybridization (SBH) technology with totalinternal reflection microscopy (TIRM) or other sensitive optical methodsusing fluorescence, nanoparticles, or electrical methods. The presentinvention also provides a sample arraying technology which createsvirtual reaction chambers that are associated with individual pixels ofan ultra-sensitive charge-coupled device (CCD) camera. Using informativepools of complete/universal sets of fluorescent-labeled oligonucleotideprobes and combinatorial ligation process, arrayed genomes arerepeatedly interrogated in order to decipher their sequences.Bioinformatics algorithms (co-owned, co-pending U.S. patent applicationSer. No. 09/874,772; Drmanac et al., Science 260:1649-1652 (1993);Drmanac et al., Nat. Biotech. 16:54-58 (1998); Drmanac et al.,“Sequencing and Fingerprinting DNA by Hybridization with OligonucleotideProbes,” In: Encyclopedia of Analytical Chemistry, pp. 5232-5237 (2000);Drmanac et al., “Sequencing by Hybridization (SBH): Advantages,Achievements, and Opportunities,” In: Advances in BiochemicalEngineering/Biotechnology: Chip Technology, Hoheisel, J. (Ed.), Vol. 76,pp. 75-98 (2002), all of which are herein incorporated by reference intheir entirety) are used to transform informative fluorescent signalsinto assembled sequence data. The device can sequence over 100 megabases of DNA per hour (30,000 bases/sec) using a single compactinstrument located in a diagnostic laboratory or small mobilelaboratory. Trace quantities of pathogen DNA can be detected, identifiedand sequenced within complex biological samples using the method of thepresent invention due to the large capacity of random single moleculearrays. Thus, random array SBH (rSBH) provides the necessary technologyto allow DNA sequencing to play an important role in the defense againstbiowarfare agents, in addition to other sequencing applications.

The present invention provides a single DNA molecule analysis method torapidly and accurately detect and identify any pathogen in complexbiological mixtures of pathogen, host, and environmental DNA, andanalyze any DNA in general, including individual human DNA. The methodof the invention allows detection of pathogens present in the sample atthe single organism level and identification of all virulencecontrolling genes. The method of the invention applies the process ofcombinatorial hybridization/ligation of small sets of universalinformative probe pools (IPPs) to random single molecule arrays directlyor after in situ amplification of individual arrayed molecules about 10-or 100-, or 1000- or 10,000-fold.

In a typical test, millions of randomly arrayed single DNA moleculesobtained from a sample are hybridized with pairs of IPPs representinguniversal libraries of all possible probe sequences 8 to 10 bases inlength. When two probes hybridize to adjacent complementary sequences intarget DNAs, they are ligated to create a positive score for that targetmolecule and the accumulated set of such scores is compiled to assemblethe target sequence from overlapping probe sequences.

In another embodiment of the present invention, the signature orsequence of individual targets can be used to assemble longer sequencesof entire genes or genomes. In addition, by counting how many times thesame molecule or segments from the same gene occur in the array,quantification of gene expression or pathogen DNA may be obtained andsuch data may be combined with the obtained sequences.

SBH is a well-developed technology that may be practiced by a number ofmethods known to those skilled in the art. Specifically, the techniquesrelated to sequencing by hybridization discussed in the followingdocuments are incorporated by reference herein in their entirety: Bainsand Smith, J. Theor. Biol. 135:303-307 (1988); Beaucage and Caruthers,Tetrahedron Lett. 22:1859-1862 (1981); Broude et al., Proc. Natl. Acad.Sci. USA 91:3072-3076 (1994); Breslauer et al., Proc. Natl. Acad. Sci.USA 83:3746-3750 (1986); Doty et al., Proc. Natl. Acad. Sci. USA46:461-466 (1990); Chee et al., Science 274:610-614 (1996); Cheng etal., Nat. Biotechnol. 16:541-546 (1998); Dianzani et al., Genomics11:48-53 (1991); PCT International Patent Application Serial No. WO95/09248 to Drmanac; PCT International Patent Application Serial No. WO96/17957 to Drmanac; PCT International Patent Application Serial No. WO98/31836 to Drmanac; PCT International Patent Application Serial No. WO99/09217 to Drmanac et al.; PCT International Patent Application SerialNo. WO00/40758 to Drmanac et al.; PCT International Patent ApplicationSerial No. WO 56937; co-owned, co-pending U.S. patent application Ser.No. 09/874,772 to Drmanac and Jin; Drmanac and Crkvenjakov, ScientiaYugoslaviea 16:99-107 (1990); Drmanac and Crkvenjakov, Intl. J. GenomeRes. 1:59-79 (1992); Drmanac and Drmanac, Meth. Enzymology 303:165-178(1999); Drmanac et al., U.S. Pat. No. 5,202,231; Drmanac et al., Nucl.Acids Res. 14:4691-4692 (1986); Drmanac et al., Genomics 4:114-128(1989); Drmanac et al., J. Biomol. Struct. Dyn. 8:1085-1102 (1991);Drmanac et al., “Partial Sequencing by Hybridization: concept andApplications in Genome Analysis,” in: The First International Conferenceon Electrophoresis, Supercomputing and the Human Genome, pp. 60-74,World Scientific, Singapore, Malaysia (1991); Drmanac et al.,Proceedings of the First Intl. Conf Electrophoresis, Supercomputing andthe Human Genome, Cantor et al. eds, World Scientific Pub. Co.,Singapore, 47-59 (1991); Drmanac et al., Nucl. Acids Res. 19:5839-5842(1991); Drmanac et al., Electrophoresis 13:566-573 (1992); Drmanac etal., Science 260:1649-1652 (1993); Drmanac et al., DNA and Cell Biol.9:527-534 (1994); Drmanac et al., Genomics 37:29-40 (1996); Drmanac etal., Nature Biotechnology 16:54-58 (1998); Gunderson et al., Genome Res.8:1142-1153 (1998); Hacia et al., Nature Genetics 14:441-447 (1996);Hacia et al., Genome Res. 8:1245-1258 (1998); Hoheisel et al., Mol. Gen.220:903-14:125-132 (1991); Hoheisel et al., Cell 73:109-120 (1993);Holey et al., Science 147:1462-1465 (1965); Housby and Southern, Nucl.Acids Res. 26:4259-4266 (1998); Hunkapillar et al., Science 254:59-63(1991); Khrapko, FEBS Lett. 256:118-122 (1989); Kozal et al., NatureMedicine 7:753-759 (1996); Labat and Drmanac, “Simulations of Orderingand Sequence Reconstruction of Random DNA Clones Hybridized with a SmallNumber of Oligomer Probes,” in: The Second International Conference onElectrophoresis, Supercomputing and the Human Genome, pp. 555-565, WorldScientific, Singapore, Malaysia (1992); Lehrach et al., Genome Analysis:Genetic and Physical Mapping 1:39-81 (1990), Cold Spring HarborLaboratory Press; Lysov et al., Dokl. Akad. Nauk. SSSR 303:1508-1511(1988); Lockhart et al., Nat. Biotechnol. 14:167501680 (1996); Maxam andGilbert, Proc. Natl. Acad. Sci. USA 74:560-564 (1977); Meier et al.,Nucl. Acids Res. 26:2216-2223 (1998); Michiels et al., CABIOS 3:203-210(1987); Milosavljevic et al., Genome Res. 6:132-141 (1996);Milosavljevic et al., Genomics 37:77-86 (1996); Nikiforov et al., Nucl.Acids Res. 22:4167-4175 (1994); Pevzner and Lipschutz, “Towards DNASequencing Chips,” in: Mathematical Foundations of Computer Science(1994); Poustka and Lehrach, Trends Genet. 2: 174-179 (1986); Privara etal., Eds., pp. 143-158, The Proceedings of the 19^(th) InternationalSymposium, MFCS '94, Kosice, Slovakia, Springer-Verlag, Berlin (1995);Saiki et al., Proc. Natl. Acad. Sci. USA 86:6230-6234 (1989); Sanger etal., Proc. Natl. Acad. Sci. USA 74:5463-5467 (1977); Scholler et al.,Nucl. Acids Res. 23:3842-3849 (1995); PCT International ApplicationSerial No. WO 89/10977 to Southern; U.S. Pat. No. 5,700,637 to Southern;Southern et al., Genomics 13:1008-1017 (1992); Strezoska et al., Proc.Natl. Acad. Sci. USA 88:10089-10093 (1991); Sugimoto et al., Nucl. AcidRes. 24:4501-4505 (1996); Wallace et al., Nucl. Acids Res. 6:3543-3557(1979); Wang et al., Science 280:1077-1082 (1998); Wetmur, Crit. Rev.Biochem. Mol. Biol. 26:227-259 (1991).

Advantages of rSBH:

rSBH minimizes or eliminates target-target blocking interactions betweentwo target DNA molecules that are attached at an appropriate distance.The low complexity of DNA sequence (between 200-2000 bases) per spotreduces the likelihood of inverse repeats that can block each other.Palindromes and hairpin arms are separated in some fragments with onecut per every 20 bases of source DNA on average and attach tonon-complementary primer DNA. False positives are minimized becauseoverlapped fragments have different repeated and/or strong mismatchsequences. Probe-probe ligation products are removable by washing. Thecombination of hybridization/ligation specificity and differentialfull-match/mismatch stability for the 11-13-mer probes made by ligationhas the potential for producing more accurate data. rSBH provides anefficient method of using three-probe ligation in solution, includinganalysis of short DNA. Pools of patterned probes can be efficiently usedon both probe components to provide more informative data. Anotheradvantage is that very low amounts of source DNA are required. The needfor standard probe-spot array preparation is eliminated, therebyreducing cost. rSBH provides for multiplex sequencing of up to 1000samples tagged with different primers or adapters. In addition, theinvention provides for detection of a single variant in a pool of up toone million individual samples. Heterozygotes can be detected bycounting two variants. The invention provides for 10- to 100,000-foldmore information per surface than the standard arrays.

6.1 Preparation and Labeling of Polynucleotides

The practice of the instant invention employs a variety ofpolynucleotides. Typically some of the polynucleotides are detectablylabeled. Species of polynucleotides used in the practice of theinvention include target nucleic acids and probes.

The term “probe” refers to a relatively short polynucleotide, preferablyDNA. Probes are preferably shorter than the target nucleic acid by atleast one base, and more preferably they are 25 bases or fewer inlength, still more preferably 20 bases or fewer in length. Of course,the optimal length of a probe will depend on the length of the targetnucleic acid being analyzed. In de novo sequencing (no referencesequence used) for a target nucleic acid composed of about 100 or fewerbases, the probes are preferably at least 7-mers; for a target nucleicacid of about 100-200 bases, the probes are preferably at least 8-mers;for a target nucleic acid of about 200-400 bases, the probes arepreferably at least 9-mers; for a target nucleic acid of about 400-800bases, the probes are preferably at least 10-mers; for a target nucleicacid of about 800-1600 bases, the probes are at least 11-mers; for atarget nucleic acid of about 1600-3200 bases, the probes are preferablyat least 12-mers; for a target nucleic acid of about 3200-6400 bases,the probes are preferably at least 13-mers; and for a target nucleicacid of about 6400-12,800 bases, the probes are preferably at least14-mers. For every additional two-fold increase in the length of thetarget nucleic acid, the optimal probe length is one additional base.

Those of skill in the art will recognize that for SBH applicationsutilizing ligated probes, the above-delineated probe lengths arepost-ligation. Probes are normally single stranded, althoughdouble-stranded probes may be used in some applications.

While typically the probes will be composed of naturally-occurring basesand native phosphodiester backbones, they need not be. For example, theprobes may be composed of one or more modified bases, such as7-deazaguanosine or the universal “M” base, or one or more modifiedbackbone interlinkages, such as a phosphorothioate. The only requirementis that the probes be able to hybridize to the target nucleic acid. Awide variety of modified bases and backbone interlinkages that can beused in conjunction with the present invention are known, and will beapparent to those of skill in the art.

The length of the probes described above refers to the length of theinformational content of the probes, not necessarily the actual physicallength of the probes. The probes used in SBH frequently containdegenerate ends that do not contribute to the information content of theprobes. For example, SBH applications frequently use mixtures of probesof the formula N_(x)B_(y)N_(z), wherein N represents any of the fourbases and varies for the polynucleotides in a given mixture, Brepresents any of the four bases but is the same for each of thepolynucleotides in a given mixture, and x, y, and z are integers.Typically, x and z are independent integers between 0 and 5 and y is aninteger between 4 and 20. The number of known bases B_(y) defines the“information content” of the polynucleotide, since the degenerate endsdo not contribute to the information content of the probes. Lineararrays comprising such mixtures of immobilized polynucleotides areuseful in, for example, sequencing by hybridization. Hybridizationdiscrimination of mismatches in these degenerate probe mixtures refersonly to the length of the informational content, not the full physicallength.

Probes for use in the instant invention may be prepared by techniqueswell known in the art, for example by automated synthesis using anApplied Biosystems synthesizer. Alternatively, probes may be preparedusing Genosys Biotechnologies Inc. methods using stacks of porous Teflonwafers. For purposes of this invention, the source of oligonucleotideprobes used is not critical, and one skilled in the art will recognizethat oligonucleotides prepared using other methods currently known orlater developed will also suffice.

The term “target nucleic acid” refers to a polynucleotide, or someportion of a polynucleotide, for which sequence information is desired,typically the polynucleotide that is sequenced in the SBH assay. Thetarget nucleic acid can be any number of nucleotides in length,depending on the length of the probes, but is typically on the order of100, 200, 400, 800, 1600, 3200, 6400, or even more nucleotides inlength. A sample typically may have more than 100, more than 1000, morethan 10,000, more than 100,000, more than one million, or more than 10million targets. The target nucleic acid may be composed ofribonucleotides, deoxyribonucleotides, or mixtures thereof. Typically,the target nucleic acid is a DNA. While the target nucleic acid can bedouble-stranded, it is preferably single stranded. Moreover, the targetnucleic acid can be obtained from virtually any source. Depending on itslength, it is preferably sheared into fragments of the above-delineatedsizes prior to using an SBH assay. Like the probes, the target nucleicacid can be composed of one or more modified bases or backboneinterlinkages.

The target nucleic acid may be obtained from any appropriate source,such as cDNAs, genomic DNA, chromosomal DNA, microdissected chromosomalbands, cosmid or yeast artificial chromosome (YAC) inserts, and RNA,including mRNA without any amplification steps. For example, Sambrook etal. Molecular Cloning: A Laboratory Manual, Cold Spring Harbor Press, NY(1989), herein incorporated by reference in its entirety, describesthree protocols for the isolation of high molecular weight DNA frommammalian cells (p. 9.14-9.23).

The polynucleotides would then typically be fragmented by any of themethods known to those of skill in the art including, for example, usingrestriction enzymes as described at 9.24-9.28 of Sambrook et al. (1989),shearing by ultrasound, and NaOH treatment. A particularly suitablemethod for fragmenting DNA utilizes the two base recognitionendonuclease, CviJI, described by Fitzgerald et al., Nucl. Acids Res.20:3753-3762 (1992), incorporated herein by reference in its entirety.

In a preferred embodiment, the target nucleic acids are prepared so thatthey cannot be ligated to each other, for example by treating thefragmented nucleic acids obtained by enzyme digestion or physicalshearing with a phosphatase (i.e. calf intestinal phosphatase).Alternatively, nonligatable fragments of the sample nucleic acid may beobtained by using random primers (i.e. N₅-N₉, wherein N=A, G, T, or C),which have no phosphate at their 5′-ends, in a Sanger-dideoxy sequencingreaction with the sample nucleic acid.

In most cases it is important to denature the DNA to yield singlestranded pieces available for hybridization. This may be achieved byincubating the DNA solution for 2-5 minutes at 80-90° C. The solution isthen cooled quickly to 2° C. to prevent renaturation of the DNAfragments before they are contacted with the probes.

Probes and/or target nucleic acids may be detectably labeled. Virtuallyany label that produces a detectable signal and that is capable of beingimmobilized on a substrate or attached to a polynucleotide can be usedin conjunction with the arrays of the invention. Preferably, the signalproduced is amenable to quantification. Suitable labels include, by wayof example and not limitation, radioisotopes, fluorophores,chromophores, chemiluminescent moieties, etc.

Due to their ease of detection, polynucleotides labeled withfluorophores are preferred. Fluorophores suitable for labelingpolynucleotides are described, for example, in the Molecular Probescatalog (Molecular Probes, Inc., Eugene, Oreg.), and the referencescited therein. Methods for attaching fluorophore labels topolynucleotides are well known, and can be found, for example, inGoodchild, Bioconjug. Chem. 1:165-187 (1990), herein incorporated byreference in its entirety. A preferred fluorophore label is Cy5 dye,which is available from Amersham Biosciences.

Alternatively, the probes or targets may be labeled by any othertechnique known in the art. Preferred techniques include direct chemicallabeling methods and enzymatic labeling methods, such as kinasing andnick-translation. Labeled probes could readily be purchased from avariety of commercial sources, including GENSET, rather thansynthesized.

In general, the label can be attached to any part of the probe or targetpolynucleotide, including the free terminus of one or more of the bases.In preferred embodiments, the label is attached to a terminus of thepolynucleotide. The label, when attached to a solid support by means ofa polynucleotide, must be located such that it can be released from thesolid support by cleavage with a mismatch-specific endonuclease, asdescribed in co-owned, co-pending U.S. patent application Ser. No.09/825,408 (herein incorporated by reference in its entirety).Preferably, the position of the label will not interfere withhybridization, ligation, cleavage or other post-hybridizationmodifications of the labeled polynucleotide.

Some embodiments of the invention employ multiplexing, i.e. the use of aplurality of distinguishable labels (such as different fluorophores).Multiplexing allows the simultaneous detection of a plurality ofsequences in one hybridization reaction. For example, a multiplex offour colors reduces the number of hybridizations required by anadditional factor of four.

Other embodiments employ the use of informative pools of probes toreduce the redundancy normally found in SBH protocols, thereby reducingthe number of hybridization reactions needed to unambiguously determinea target DNA sequence. Informative pools of probes and methods of usingthe same can be found in co-owned, co-pending U.S. patent applicationSer. No. 09/479,608, which is incorporated herein by reference in itsentirety.

6.2 Attachment of Polynucleotides to a Solid Substrate

Some embodiments of the instant invention require polynucleotides, forexample target DNA fragments, to be attached to a solid substrate. Inpreferred embodiments, the appropriate DNA samples are detectablylabeled and randomly attached to a solid substrate at a concentration of1 fragment per pixel.

The nature and geometry of the solid substrate will depend upon avariety of factors, including, among others, the type of array and themode of attachment (i.e. covalent or non-covalent). Generally, thesubstrate can be composed of any material which will permitimmobilization of the polynucleotide and which will not melt orotherwise substantially degrade under the conditions used to hybridizeand/or denature nucleic acids. In addition, where covalentimmobilization is contemplated, the substrate should be activatable withreactive groups capable of forming a covalent bond with thepolynucleotide to be immobilized.

A number of materials suitable for use as substrates in the instantinvention have been described in the art. In preferred embodiments, thesubstrate is made of an optically clear substance, such as glass slides.Other exemplary suitable materials include, for example, acrylic,styrene-methyl methacrylate copolymers, ethylene/acrylic acid,acrylonitrile-butadiene-styrene (ABS), ABS/polycarbonate,ABS/polysulfone, ABS/polyvinyl chloride, ethylene propylene, ethylenevinyl acetate (EVA), nitrocellulose, nylons (including nylon 6, nylon6/6, nylon 6/6-6, nylon 6/10, nylon 6/12, nylon 11, and nylon 12),polycarylonitrile (PAN), polyacrylate, polycarbonate, polybutyleneterephthalate (PBT), polyethylene terephthalate (PET), polyethylene(including low density, linear low density, high density, cross-linkedand ultra-high molecular weight grades), polypropylene homopolymer,polypropylene copolymers, polystyrene (including general purpose andhigh impact grades), polytetrafluoroethylene (PTFE), fluorinatedethylene-propylene (FEP), ethylene-tetrafluoroethylene (ETFE),perfluoroalkoxyethylene (PFA), polyvinyl fluoride (PVF), polyvinylidenefluoride (PVDF), polychlorotrifluoroethylene (PCTFE),polyethylene-chlorotrifluoroethylene (ECTFE), polyvinyl alcohol (PVA),silicon styrene-acrylonitrile (SAN), styrene maleic anhydride (SMA),metal oxides and glass.

In general, polynucleotide fragments may be bound to a support throughappropriate reactive groups. Such groups are well known in the art andinclude, for example, amino (—NH₂), hydroxyl (—OH), or carboxyl (—COOH)groups. Support-bound polynucleotide fragments may be prepared by any ofthe methods known to those of skill in the art using any suitablesupport such as glass. Immobilization can be achieved by many methods,including, for example, using passive adsorption (Inouye and Hondo, J.Clin. Microbiol. 28:1469-1472 (1990), herein incorporated by referencein its entirety), using UV light (Dahlen et al., Mol. Cell Probes1:159-168 (1987), herein incorporated by reference in its entirety), orby covalent binding of base-modified DNA (Keller, et al., Anal. Biochem.170:441-451 (1988), Keller et al., Anal. Biochem. 177:392-395 (1989),both of which are herein incorporated by reference in their entirety),or by formation of amide groups between the probe and the support (Zhanget al., Nucl. Acids Res. 19:3929-3933 (1991), herein incorporated byreference in its entirety).

It is contemplated that a further suitable method for use with thepresent invention is that described in PCT Patent Application WO90/03382 (to Southern et al.), incorporated herein by reference. Thismethod of preparing a polynucleotide fragment bound to a supportinvolves attaching a nucleoside 3′-reagent through the phosphate groupby a covalent phosphodiester link to aliphatic hydroxyl groups carriedby the support. The oligonucleotide is then synthesized on the supportednucleoside and protecting groups removed from the syntheticoligonucleotide chain under standard conditions that do not cleave theoligonucleotide from the support. Suitable reagents include nucleosidephosphoramidite and nucleoside hydrogen phosphorate.

Alternatively, addressable-laser-activated photodeprotection may beemployed in the chemical synthesis of oligonucleotides directly on aglass surface, as described by Fodor et al., Science 251:767-773 (1991),incorporated herein by reference.

One particular way to prepare support-bound polynucleotide fragments isto utilize the light-generated synthesis described by Pease et al.,Proc. Natl. Acad. Sci. USA 91:5022-5026 (1994), incorporated herein byreference. These authors used current photolithographic techniques togenerate arrays of immobilized oligonucleotide probes, i.e. DNA chips.These methods, in which light is used to direct the synthesis ofoligonucleotide probes in high-density, miniaturized arrays, utilizephotolabile 5′-protected N-acyl-deoxynucleoside phosphoramidites,surface linker chemistry and versatile combinatorial synthesisstrategies. A matrix of 256 spatially defined oligonucleotide probes maybe generated in this manner and then used in SBH sequencing, asdescribed herein.

In a preferred embodiment, the DNA fragments of the invention areconnected to the solid substrate by means of a linker moiety. The linkermay be comprised of atoms capable of forming at least two covalentbonds, such as carbon, silicon, oxygen, sulfur, phosphorous, and thelike, or may be comprised of molecules capable of forming at least twocovalent bonds, such as sugar-phosphate groups, amino acids, peptides,nucleosides, nucleotides, sugars, carbohydrates, aromatic rings,hydrocarbon rings, linear and branched hydrocarbons, and the like. In aparticularly preferred embodiment of the invention, the linker moiety iscomposed of alkylene glycol moieties. In preferred embodiments, adetectable label is attached to the DNA fragment (i.e. target DNA).

6.3 Formation of Detectably Labeled Duplexes on a Solid Support

In one preferred embodiment of the invention, a labeled probe is boundby means of complementary base-pairing interactions to a detectablylabeled target nucleic acid that is itself attached to a solid substrateas part of a polynucleotide array, thereby forming a duplex. In anotherpreferred embodiment, a labeled probe is covalently attached, i.e.ligated, to another probe that is bound by means of complementarybase-pairing interactions to a target nucleic acid that is itselfattached to a solid substrate as part of a spatially-addressablepolynucleotide array, if the two probes hybridize to a target nucleicacid in a contiguous fashion.

As used herein, nucleotide bases “match” or are “complementary” if theyform a stable duplex or binding pair under specified conditions. Thespecificity of one base for another is dictated by the availability andorientation of hydrogen bond donors and acceptors on the bases. Forexample, under conditions commonly employed in hybridization assays,adenine (“A”) matches thymine (“T”), but not guanine (“G”) or cytosine(“C”). Similarly, G matches C, but not A or T. Other bases whichinteract in less specific fashion, such as inosine or the Universal Base(“M” base, Nichols et al., Nature 369:492-493 (1994), hereinincorporated by reference in its entirety), or other modified bases, forexample methylated bases, are complementary to those bases for whichthey form a stable duplex under specified conditions. Nucleotide baseswhich are not complementary to one another are termed “mismatches.”

A pair of polynucleotides, e.g. a probe and a target nucleic acid, aretermed “complementary” or a “match” if, under specified conditions, thenucleic acids hybridize to one another in an interaction mediated by thepairing of complementary nucleotide bases, thereby forming a duplex. Aduplex formed between two polynucleotides may include one or more basemismatches. Such a duplex is termed a “mismatched duplex” orheteroduplex. The less stringent the hybridization conditions are, themore likely it is that mismatches will be tolerated and relativelystable mismatched duplexes can be formed.

A subset of matched polynucleotides, termed “perfectly complementary” or“perfectly matched” polynucleotides, is composed of pairs ofpolynucleotides containing continuous sequences of bases that arecomplementary to one another and wherein there are no mismatches (i.e.absent any surrounding sequence effects, the duplex formed has themaximal binding energy for the particular nucleic acid sequences).“Perfectly complementary” and “perfect match” are also meant toencompass polynucleotides and duplexes which have analogs or modifiednucleotides. A “perfect match” for an analog or modified nucleotide isjudged according to a “perfect match rule” selected for that analog ormodified nucleotide (e.g. the binding pair that has maximal bindingenergy for a particular analog or modified nucleotide).

In the case where a pool of probes with degenerate ends of the typeN_(x)B_(y)N_(z) is used, as described above, a perfect match encompassesany duplex where the information content regions, i.e. the B_(y)regions, of the probes are perfectly matched. Discrimination againstmismatches in the N regions will not affect the results of ahybridization experiment, since such mismatches do not interfere withthe information derived from the experiment.

In a particularly preferred embodiment of the invention, apolynucleotide array is provided wherein target DNA fragments areprovided on a solid substrate under conditions which permit them tohybridize with at least one set of detectably labeled oligonucleotideprobes provided in solution. Both within the sets and between the setsthe probes may be of the same length or of different lengths. Guidelinesfor determining appropriate hybridization conditions can be found inpapers such as Drmanac et al., (1990), Khrapko et al. (1991), Broude etal., (1994) (all cited supra) and WO 98/31836, which is incorporatedherein by reference in its entirety. These articles teach the ranges ofhybridization temperatures, buffers, and washing steps that areappropriate for use in the initial steps of SBH. The probe sets may beapplied to the target nucleic acid separately or simultaneously.

Probes that hybridize to contiguous sites on the target nucleic acid arecovalently attached to one another, or ligated. Ligation may beimplemented by a chemical ligating agent (e.g. water-solublecarbodiimide or cyanogen bromide), by a ligase enzyme, such as thecommercially available T₄ DNA ligase, by stacking interactions, or byany other means of causing chemical bond formation between the adjacentprobes. Guidelines for determining appropriate conditions for ligationcan be found in papers such as co-owned U.S. patent application Ser.Nos. 09/458,900, 09/479,608, and 10/738,108, all of which are hereinincorporated by reference in their entirety.

6.4 Random Array SBH (rSBH)

The method of the present invention uses random array SBH (rSBH) whichextends the combinatorial ligation process to single molecule arrays,greatly increasing the sensitivity and power of the method of theinvention. rSBH relies on successive interrogations of randomly arrayedDNA fragments by informative pools of labeled oligonucleotides. In themethod of the present invention, complex DNA mixtures to be sequencedare displayed on an optically clear surface within the focal plane of atotal internal fluorescence reflection microscopy (TIRM) platform andcontinuously monitored using an ultra-sensitive mega pixel CCD camera.DNA fragments are arranged at a concentration of approximately 1 to 3molecules per square micron, an area corresponding to a single CCDpixel. TIRM is used to visualize focal and close contacts between theobject being studied and the surface to which it is attached. In TIRM,the evanescent field from an internally reflected excitation sourceselectively excites fluorescent molecules at or near a surface,resulting in very low background scattered light and goodsignal-to-background contrast. The background and its associated noisecan be made low enough to detect single fluorescent molecules underambient conditions. (see Abney et al. Biophys. J. 61:542-552 (1992);Ambrose et al., Cytometry 36:224-231 (1999); Axelrod, Traffic 2:764-774(2001); Fang and Tan, “Single Molecule Imaging and Interaction StudyUsing Evanescent Wave Excitation,” American Biotechnology Laboratory(ABL) Application Note, April 2000; Kawano and Enders, “Total InternalReflection Fluorescence Microscopy,” American Biotechnology LaboratoryApplication (ABL) Application Note, December 1999; Reichert and Truskey,J. Cell Sci. 96 (Pt. 2):219-230 (1990), all of which are hereinincorporated by reference in their entirety).

Using microfluidic technology, pairs of probe pools labeled with donorand acceptor fluorophores are mixed with DNA ligase and presented to therandom array. When probes hybridize to adjacent sites on a targetfragment, they are ligated together generating a fluorescence resonanceenergy transfer (FRET) signal. FRET is a distance-dependent (between10-100 Å) interaction between the electronic excited states of twofluorescent molecules in which excitation is transferred from a donormolecule to an acceptor molecule without emission of a photon (Didenko,Biotechniques 31:1106-1121 (2001); Ha, Methods 25:78-86 (2001);Klostermeier and Millar, Biopolymers 61:159-179 (2001-2002), all ofwhich are herein incorporated by reference in their entirety). Thesesignals are detected by the CCD camera indicating a matching sequencestring within that fragment. Once the signals from the first pool aredetected, probes are removed and successive cycles are used to testdifferent probe combinations. The entire sequence of each DNA fragmentis compiled based on fluorescent signals generated by hundreds ofindependent hybridization/ligation events.

Although only one detectable color will suffice, multiple colors willincrease multiplexing of the combinatorics and improve the efficiency ofthe system. The current state of the art suggests that four colors canbe used simultaneously. In addition to traditional direct fluorescencestrategies, FRET-based systems, time-resolved systems and time-resolvedFRET signaling systems will also be used (Didenko, Biotechniques31:1106-1121 (2001), herein incorporated by reference in its entirety).New custom chemistries, such as quantum dot enhanced triple FRET systemsmay also be used. Overcoming a weak signal may be overcome usingdendrimer technologies and related signal amplification technologies.

Unlike traditional hybridization processes, the method of the presentinvention relies on a synergistic interaction of hybridization andligation, in which short probes from two pools are ligated together togenerate longer probes with far more informational power. For example,two sets of 1024 five-mer oligonucleotides can be combined to detectover a million possible 10-mer sequence strings. The use of informativeprobe pools (in which all probes share a common label) greatlysimplifies the process, allowing millions of potential probe pairings tooccur with only a few hundred pool combinations. Multiple overlappingprobes reading consecutive bases allow an accurate determination of DNAsequence from the obtained hybridization pattern. The combinatorialligation and informative pools technologies described above areaugmented by extending their use to single molecule sequencing.

6.5 Structured Random DNA Preparation A. DNA Isolation and InitialFragmentation

Cells are lysed and DNA is isolated using basic well-establishedprotocols (Sambrook et al., supra, 1999; Current Protocols in MolecularBiology, Ausubel et al., eds. John Wiley and Sons, Inc., NY, 1999, bothof which are herein incorporated by reference in their entirety) orcommercial kits [e.g. those available from QIAGEN (Valencia, Calif.) orPromega (Madison, Wis.)]. Critical requirements are: 1) the DNA is freeof DNA processing enzymes and contaminating salts; 2) the entire genomeis equally represented; and 3) the DNA fragments are between ˜5,000 and˜100,000 bp in length. No digestion of the DNA is required because shearforces created during lysis and extraction will give rise to fragmentsin the desired range. In another embodiment, shorter fragments (1-5 kb)can be generated by enzymatic fragmentation. The input genome number of10-100 copies will ensure overlap of the entire genome and toleratespoor capture of targets on the array. A further embodiment provides forcarrier, circular synthetic double-stranded DNA to be used in the caseof small amounts of DNA.

B. DNA Normalization

In some embodiments, normalization of environmental samples may benecessary to reduce the DNA contribution of prevalent species tomaximize the total number of distinct species that are sequenced perarray. Because rSBH requires as few as 10 genome equivalents, a thoroughDNA normalization or subtraction process can be implemented.Normalization can be accomplished using commonly utilized methods usedfor normalizing cDNA libraries during their production. DNA collectedfrom the sample is divided in two, with one being of ten-fold greatermass than the other. The sample of greater quantity is biotinylated byterminal transferase and ddCTP and attached as a single stranded DNA toa streptavidin column or streptavidin-coated beads. Alternatively,biotinylated random primers may be employed to generate sequence forattachment to streptavidin. Whole genome amplification methodologies(Molecular Staging, Inc., New Haven, Conn.) can also be applied. Thesample to be normalized is then hybridized to the attached molecules andthose molecules that are over-represented in the sample arepreferentially removed from the solution due to the greater number ofbinding sites. Several hybridization/removal cycles can be applied onthe same sample to achieve full normalization. Another embodimentprovides for efficient hybridization of long double-stranded DNAfragments without DNA denaturation by generating short terminal regionsof single-stranded DNA with a timed lambda exonuclease digestion.

Further embodiments provide for sequencing low abundance members thatare difficult to analyze by combining DNA normalization and rSBH.Normalization of one sample against another allows monitoring of changesin community structure and identifying new members as conditions change.

C. Secondary DNA Fragmentation and Adapter Attachment

The present invention provides for long DNA fragments generated by shearforces to be suspended in solution within a chamber located on the glassslide. The concentration of the DNA is adjusted such that the volumeoccupied by each fragment is in the order of 50×50×50 μm. The reactionchamber comprises a mix of restriction enzymes, T4 DNA ligase, a stranddisplacing polymerase, and specially designed adapters. Partialdigestion of the DNA by the restriction enzymes yields fragments with anaverage length of 250 bp with uniform overhang sequences. T4 DNA ligasejoins non-phosphorylated double-stranded adapters to the ends of thegenomic fragments via complementary sticky-ends resulting in a stablestructure of genomic insert with one adapter at each end, but with anick in one of the strands where the ligase was unable to catalyze theformation of the phosphodiester bond (FIG. 1). T4 DNA ligase is activein most restriction enzyme buffers but requires the addition of ATP anda molar excess of adapters relative to genomic DNA to promote theligation of adapters at each end of the genomic molecule. Usingnon-phosphorylated adapters is important to prevent adapter-adapterligation. Additionally, the adapters contain two primer-binding sitesand are held in a hairpin structure by cross-linked bases at the hairpinend that prevents dissociation of the adapters during melting atelevated temperatures. Extension from the 3′-ends with astrand-displacing polymerase such as Vent or Bst results in theproduction of a DNA strand with adapter sequences at both ends. However,at one end the adapter will be maintained in the hairpin structure thatis useful to prevent association of complementary sequences on the otherend of the DNA fragment.

The invention provides for random DNA arrays to sequence multiples ofhighly similar samples (i.e. individual DNA from patients) in one assayby tagging DNA fragments of each sample prior to random array formation.One or both adapters used for incorporation of primer sequences at theend of the DNA fragments can have a tag cassette. A different tagcassette can be used for each sample. After attaching adapters(preferably by ligation), DNA of all samples is mixed and single randomarray is formed. After sequencing of fragments is completed, fragmentsthat belong to each sample are recognized by the assigned tag sequence.Use of the tag approach allows efficient sequencing of a smaller numberof targeted DNA regions from about 10-1000 samples on high capacityrandom arrays having up to about 10 million DNA fragments.

D. DNA Attachment and in situ Amplification

The adapter-linked genomic DNA is then localized with other fragmentsfrom the original 5-100 kb fragment onto the glass slide byhybridization to an oligonucleotide that is complementary to the adaptersequences (primer B). After adapter ligation and DNA extension, thesolution is heated to denature the molecules which, when in contact witha high concentration of primer oligonucleotides attached to the surfaceof the slide, hybridizes to these complementary sequences during there-annealing phase. In an alternative embodiment, in situ amplificationdoes not occur and the adapter is attached to the support and the DNAfragments are ligated. Most of the DNA structures that arise from theone parent molecule are localized to one section of the slide in theorder of 50×50 μm; therefore if 1000 molecules are generated from therestriction digest of one parent molecule, each fragment will occupy, onaverage, a 1-4 μm² region. Such 1-4 μm² region can be observed by asingle pixel of a CCD camera and represents a virtual reaction wellwithin an array of one million wells.

Lateral diffusion of DNA fragments more than 50 μm across the slidesurface is unlikely to be significant in the short period of time in a50-100 μm thick capillary chamber that prevents liquid turbulence. Inaddition, high viscosity buffers or gel can be used to minimizediffusion. In yet a further embodiment, limited turbulence is needed tospread hundreds of short DNA fragments derived from single 5-100 kbmolecules over a 50×50 μm surface. Note that the spreading does not haveto be perfect because SBH can analyze mixtures of a few DNA fragments atthe same pixel location. A few fractions of the original sample withmore uniform fragment length (i.e. 5-10 kb, 10-20 kb, 20-40 kb, 40-1000kb) may be prepared to achieve equal spacing between short fragments.Furthermore, an electric field can be used to pool short DNA fragmentsto the surface for attachment. Partially structured arrays with localmixing of short fragments are almost as efficient as fully structuredarrays because no short fragments from any single, long fragment ismixed with short fragments generated from about 10,000 other longinitial fragments.

A further embodiment of the invention provides for a ligation processthat attaches two primer sequences to DNA fragments. This approach isbased on targeting single stranded DNA produced by denaturation ofdouble-stranded DNA fragments. Because single-stranded DNA has unique 5′and 3′ ends, specific primer sequences can be attached to each end. Twospecific adapters, each comprising two oligonucleotides, are designed(see FIG. 2) that have specifically modified ends, wherein F and Brepresent unbound, solution-free primer (F) and surface-bound (B) primersequences and f and b represent sequences complementary to these primersequences (i.e. primer f is complementary to primer F). The only 3′-OHgroup that is necessary for ligation to the DNA fragment is on primer F,the other oligonucleotides can have a dideoxy 3′ end (dd) to preventadapter-adapter ligation. In addition to the 5′-phosphate group (P)present on primer b, primer B may also have a 5′-P group to be used fordegradation of this primer after adapter ligation to expose primer bsequence for hybridization to the surface-attached primer/capture probeB. To allow for adapter ligation to any DNA fragment generated from thesource DNA by random fragmentation, the oligonucleotides f and B haveseveral (approximately 3-9, preferably 5-7) degenerate bases (Ns).

Although rSBH detection is designed for single molecule detection, someembodiments amplify each DNA target in situ. The method of the inventionprovides for isothermal, exponential amplification within a micron-sizedspot of localized amplicon, herein denoted as “ampliot” (defined to bean amplicon spot) (FIG. 3). The amplification is achieved by use of aprimer bound to the surface (primer B) and one free primer in solution(primer F). Primer B first hybridizes to the original target sequenceand is extended, copying the target sequence. The non-attached strand ismelted and washed away and new reagent components are added, including aDNA polymerase with strand-displacing properties (such as Bst DNApolymerase), dNTSs, and primer F. A continuous amplification reaction isthen used to synthesize a new strand and displace the previoussynthesized complement.

The continuous exponential amplification reaction produces a displacedstrand, which contains complementary sequences to the capture arrayoligonucleotide and thus, in turn, is captured and used as a templatefor further amplification. This process of strand displacement requiresthat the primer is able to continuously initiate polymerization. Thereare several described strategies in the art, such as ICAN™ technology(Takara BioEurope, Gennevilliers, France) and SPIA technology (NuGEN,San Carlos, Calif.; U.S. Pat. No. 6,251,639, herein incorporated byreference in its entirety). The property of RNase H that degrades RNA inan RNA/DNA duplex is utilized to remove the primer once extension hasbeen initiated allowing another primer to hybridize and initiatepolymerization and strand displacement. In a preferred embodiment, aprimer F site is designed in the adapter to be A/T rich such thatdouble-stranded DNA has the ability to frequently denature and allowbinding of the F primer at the temperature optimal for the selected DNApolymerase. Approximately 100 to 1000 copies in the ampliot aregenerated through a continuous exponential amplification without theneed for thermocycling.

Yet a further embodiment of the invention incorporates the T7 promoterinto the adapter and synthesizes RNA as an intermediate (FIG. 4).Double-stranded DNA is first generated on the slide surface using a nicktranslating or strand-displacing polymerase. The newly formed strandacts as the template for T7 polymerase and also forms the necessarydouble-stranded promoter by extension from primer B. Transcription fromthe promoter produces RNA strands that can hybridize to nearbysurface-bound primer, which in turn can be reverse transcribed withreverse transcriptase. This linear amplification process can produce100-1000 target copies. The cDNA produced can then be converted tosingle stranded DNA by degradation of the RNA strand in the RNA/DNAduplex with RNase H or by alkali and heat treatment. To minimizeintramolecular hybridization of primer B sequence in the RNA molecule,half of the sequence of primer B can come from the T7 promoter sequence,thus reducing the amount of complementary sequences generated to aroundten bases.

Both amplification methods are isothermal assuring limited diffusion ofthe synthesized strands to only within the ampliot region. The ampliotsize is about 2 μm, but it can be up to 10 μm because amplified DNAsignal can offset a 25-fold increase in total surface background per CCDpixel. Furthermore, primer B attachment sites are spaced at about 10 nmapart (10,000 μm²) providing immediate capture of the displaced DNA.Buffer turbulence is almost eliminated by the enclosed capillaryreaction chamber.

Yet a further embodiment of the invention provides a method forisothermal amplification using strand displacement enzymes based on theformation of single-stranded DNA for primer annealing by an invaderoligonucleotide (see FIG. 5). Double stranded DNA can be amplified at aconstant temperature using two primers, one invader oligonucleotide orother agent, and strand displacement polymerases, such as Klenowfragment polymerase. The invader oligonucleotide is in equal or higherconcentration relative to the corresponding primer(s). The target DNA isinitially about 100 to 100 million-fold or less concentrated than theprimers.

The method of isothermal amplification using an invader oligonucleotidecomprises the steps of:

1) Binding of the invader (that can be prepared in part from LNA or PNAor other modifications that provide stronger binding to DNA) to one ofthe 5′-end sequences of the target DNA by an invasion process. Theinvader can have a single-stranded or double-stranded overhang (Ds).Invasion can be helped by low duplex stability of (TA)_(x) or similarsequences that can be added to the corresponding end of the target DNAvia an adapter.

2) Hybridizing of primer 1 to the available single stranded DNA site andinitiation of primer extension and displacement of one DNA strand by thepolymerase. The invader is partially complementary to primer 1. To avoidcomplete blocking of the primer, the size and binding efficiency of thecomplementary portion are designed to provide a bound/unboundequilibrium of about 9:1 at the temperature and concentrations used.Approximately 10% of the free primer 1 is in excess over the target DNA.

3) Hybridizing of primer 2 to the opposite end of the single strandedDNA and creation of a new double stranded DNA by the polymerase.

4) Repeating steps 1-3 due to continuous initiation of steps 1-3 by theinitial and new dsDNA molecules.

E. Probes and Pools Design

One or more detectable color can be used; however multiple colors wouldreduce the number of ligation cycles and improve the efficiency of thesystem. The current state of the art suggests that four colors can beused simultaneously. The preferred embodiment of the invention utilizesFRET-based systems, time resolved systems and time-resolved FRETsignaling systems (Didenko, 2001, supra). Custom chemistries, such asquantum dot enhanced triple FRET systems, as well as dendrimertechnologies are also contemplated.

Two sets of universal probes for FRET-based detection are used in thepreferred embodiment. Using the probe design previously described inco-owned U.S. patent application Ser. Nos. 09/479,608 and 10/608,298(herein incorporated by reference in their entirety) all 4096 possiblehexamers with 1024 or less individual synthesis are produced. Probes aresubjected to the matriculation and QC (quality control) processingprotocols (Callida Genomics, Inc., Sunnyvale, Calif.) prior to use inexperiments. Probes are designed to have minimal efficiency differenceand actual behavior of each probe with full-match and mismatch targetsare determined by the QC assays and used by an advanced base-callingsystem (Callida Genomics, Inc.).

6.6 Core Technologies

The method of the present invention relies on three coretechnologies: 1) universal probes, which allow complete sequencing byhybridization of DNA from any organism and detection of any possiblesequence alteration. These probes are designed using statisticalprinciples without referring to a known gene sequence (see co-owned,co-pending U.S. patent application Ser. No. 10/608,293, hereinincorporated by reference in its entirety); 2) combinatorial ligation,in which two small universal sets of short probes are combined toproduce tens of thousands of long probe sequences with superiorspecificity provided by “enzymatic proofreading” by DNA ligase (see U.S.patent application Ser. No. 10/608,293); 3) informative probe pools(IPPs), mixtures of hundreds of identically tagged probes of differentsequences that simplify the hybridization process without negativeimpact on sequence determination (see U.S. patent application Ser. No.09/479,608, herein incorporated by reference in its entirety).

The method of the present invention uses millions of single molecule DNAfragments, randomly arrayed on an optically clear surface, as templatesfor hybridization/ligation of fluorescently tagged probe pairs fromIPPs. A sensitive mega pixel CCD camera with advanced optics is used tosimultaneously detect millions of these individualhybridization/ligation events on the entire array (FIG. 6). DNAfragments (25 to 1500 bp in length) are arrayed at a density of about 1molecule per CCD pixel (1 to 10 molecules per square micron ofsubstrate). Each CCD pixel defines a virtual reaction cell of about 0.3to 1 μm containing one (or a few) DNA fragments and hundreds of labeledprobe molecules. The ability of SBH to analyze mixtures of samples andassemble sequences of each included fragment is of great benefit forrandom arrays. DNA density can be adjusted to have 1-3 fragments thatcan be efficiently analyzed in more than 90% of all pixels. The volumeof each reaction is about 1-10 femtoliters. A 3×3 mm array has thecapacity to hold 100 million fragments or approximately 100 billion DNAbases (the equivalent of 30 human genomes).

6.7 Combinatorial SBH

As described above, standard SBH has significant advantages overcompeting gel-based sequencing technologies, including improvements insample read length. Ultimately, however, standard SBH processes arelimited by the need to use exponentially larger probe sets to sequencelonger and longer DNA targets.

Combinatorial SBH overcomes many of the limitations of standard SBHtechniques. In combinatorial SBH (U.S. Pat. No. 6,401,267 to Drmanac,herein incorporated by reference in its entirety), two complete,universal sets of short probes are exposed to target DNA in the presenceof DNA ligase. Typically, one probe set is attached to a solid supportsuch as a glass slide, while the other set, labeled with a fluorophore,is free in solution (FIGS. 6 and 7). When attached and labeled probeshybridize the target at precisely adjacent positions, they are ligatedgenerating a long, labeled probe that is covalently lined to thesurface. After washing to remove the target and unattached probes,fluorescent signals at each array position are scored by a standardarray reader. A positive signal at a given position indicates thepresence of a sequence within the target that complements the two probesthat were combined to generate the signal. Combinatorial SBH hasenormous read length, cost and material advantages over standard SBHmethods. For example, in standard SBH a full set of over a million10-mer probes is required to accurately sequence (for purposes ofmutation discovery) a DNA target of length 10-100 kb. In contrast, withcombinatorial SBH, the same set of 10-mers is generated by combining twosmall sets of 1024 5-mers. By greatly reducing experimental complexity,costs and material requirements, combinatorial SBH allows dramaticimprovements in DNA read length and sequencing efficiency.

6.8 Informative Probe Pools

The efficiencies of combinatorial SBH are further amplified by the useof informative probe pools (IPPs). IPPs are statistically selected setsof probes that are pooled during the hybridization process to minimizethe number of combinations that must be tested. A set of IPPs,containing from 4 to 64 different pools, is designed to unambiguouslydetermine any given target sequence. Each pool set comprises a universalset of probes. Pools typically range in size from 16 to 256 probes. Whena positive signal results from one or more of these probes, all probesin the pool receive a positive score. The scores from any independentIPP pairings are used to generate a combined probability score for eachbase position. Accurate sequence data is virtually certain becausescores for ten or more overlapping probes, each in different pools, arecombined to generate the score for each base position. A false positivescore for one probe is easily offset by the correct scores of manyothers from different pools. In addition, sequencing complementary DNAstrands independently minimizes the impact of pool-related falsepositive probes because the real positive probes for each complementarystrand tend to fall, by chance, in different pools. IPPs of longerprobes are actually more informative and provide more accurate data thanindividually scored shorter probes. For example, 16,000 pools of 6410-mers provide 100-fold fewer false positives than 16,000 individual7-mers for a 2 kb DNA fragment.

Sets of IPPs will be used to acquire sequence information from arrayedDNA targets. IPPs are carefully selected pools of oligonucleotides of agiven length, with each pool typically containing 16 to 128 individualprobes. All possible oligos of that length are represented at least oncein each set of IPPs. One set of IPPs is labeled with donor fluorophores,the other set is labeled with acceptor fluorophores. These act togetherto generate FRET signals when ligation between probes from donor andacceptor sets occurs. Such ligation events occur only when the twoprobes hybridize simultaneously to adjacent complementary sites on atarget, thus identifying an 8-10 base long complementary sequence withinit. The length of DNA that can be analyzed per pixel is a function ofprobe length, pool size, and number of pairs of probe pools tested, andtypically ranges from 20 to 1500 bp. By increasing the number of poolsand/or probes, several kilobases of target DNA can be sequenced. Partialsequencing and/or signature analysis of 1-10 kb of DNA fragments can beaccomplished using small subsets of IPPs or even individual probe pairs.IPP pairs may be tested in consecutive hybridization cycles orsimultaneously, if multiplex fluorescent labels are used. The fixedposition of the CCD camera relative to the array ensures accuratetracking of consecutive hybridizations to individual target molecules.

IPPs are designed to promote strong FRET signals and sequence-specificligation. Typical probe design includes 5′-F_(x)-N₁₋₄-B₄₋₅-OH-3′ for thefirst set of IPPs and 5′-P-B₄₋₅-N₁₋₄-F_(y)-3′ for the second set,wherein F_(x) and F_(y) are donor and acceptor fluorophores, B_(n) arespecific (informatic) bases, and N_(n) are degenerate (randomly mixed)bases. The presence of degenerate bases increases the effective probelength without increasing experimental complexity. Each probe setrequires synthesizing 256 to 1024 probes and then mixing them to createpools of 16 or more probes per pool, for a total of 8 to 64 IPPs perset. Individual probes may be present in one or more pools as needed tomaximize experimental sensitivity, flexibility, and redundancy. Poolsfrom the donor set are hybridized to the array sequentially with poolsfrom the acceptor set in the presence of DNA ligase. Once each pool fromthe donor set has been paired with the acceptor pool, all possiblecombinations of 8-10 base informatic sequences have been scored, thusidentifying the complementary sequences within the target molecules ateach pixel. The power of the technique is that two small sets ofsynthetic oligonucleotide probes are used combinatorially to create andscore potentially millions of longer sequences strings.

The precise biochemistry of the process relies on sequence-specifichybridization and enzymatic ligation of two short oligonucleotides usingindividual DNA target molecules as templates. Although only a singletarget molecule is interrogated per pixel at any moment, hundreds ofprobe molecules of the same sequence will be available to each targetfor fast consecutive interrogations to provide statistical significanceof the measurements. The enzymatic efficiency of the ligation processcombined with the optimized reaction conditions provides fast multipleinterrogation of the same single target molecule. Under relatively highprobe concentrations and high reaction temperatures, individual probeshybridize quickly (within 2 seconds) but dissociate even more rapidly(about 0.5 seconds) unless they are ligated. Alternatively, ligatedprobes remain hybridized to the target for approximately 4 seconds atoptimized temperatures, continuously generating FRET signals that aredetected by the CCD camera. By monitoring each pixel for 60 seconds at1-10 image frames per second, on average 10 consecutive ligation eventswill occur at the matching target sequences, generating a light signalat that position for about 40 of the 60 seconds. In the case ofmismatched targets, ligation efficiency is about 30 fold lower, thusrarely generating ligation events and producing little or no signalduring the 60 second reaction time.

The main detection challenge is minimization of background signal, whichmay result from the required excess of labeled probe molecules. Besidesfocusing CCD pixels on the smallest possible substrate area, our primarysolution to this problem relies on a synergistic combination of surfaceproximity and the FRET technique (FIG. 7). Long-lasting excitation ofthe reporting label on one probe will occur only when a pair of probesis aligned on the same target molecule at close proximity to theilluminated surface (for example within a 100 nm wide evanescent fieldgenerated by total internal reflection). Thus, background signal willnot be generated from excess non-hybridized probes in solution, sinceeither the donor will be too far from the surface to be illuminated, orthe acceptor will be too far from the donor to cause energy transfer. Inaddition, probe molecules can be tagged with multiple dye molecules(attached by branched dendrimers) to increase probe signal over generalsystem background.

After all IPPs are tested, sequence assembly of individual moleculeswill be performed using SBH algorithms and software (co-owned,co-pending U.S. patent application Ser. No. 09/874,772; Drmanac et al.,Science 260:1649-1652 (1993); Drmanac et al., Electrophoresis 13:566-573(1992); Drmanac et al., J. Biomol. Struct. Dyn. 8:1085-1102 (1991);Drmanac et al., Genomics 4:114-128 (1989); U.S. Pat. Nos. 5,202,231 and5,525,464 to Drmanac et al., all of which are herein incorporated byreference in their entirety). These advanced statistical proceduresdefine the sequence that matches the ligation data with the highestlikelihood. The light intensities measured by the CCD camera are treatedas probabilities that full-match sequences for the given probe pairsexist at that pixel/target site. Because several positive overlappingprobes from different pools independently “read” each base in thecorrect sequence (FIG. 8), the combined probability of these probesprovides accurate base determination even if a few probes fail.Alternatively, multiple independent probes corresponding to incorrectsequences fail to hybridize with the target, giving a low combinedprobability for that sequence. This occurs even if a few probescorresponding to the incorrect sequence appear positive because theyhappen to be present in an IPP having a true positive probe matching thereal sequence.

6.9 the rSBH Process

The core of the rSBH process of the invention involves the creation andanalysis of high-density random arrays containing millions of genomicDNA fragments. Such random arrays eliminate the costly, time-consumingsteps of arraying probes on the substrate surface and the need forindividual preparation of thousands of sequencing templates. Instead,they provide a fast and cost-effective way to analyze complex DNAmixtures containing 10 Mb to 10 Gb in a single assay.

The rSBH process of the invention combines the advantages of: 1)combinatorial probe ligation of two IPPs in solution to generatesequence-specific FRET signals; 2) the accuracy, long read length, andability of the combinatorial method to analyze DNA mixtures in oneassay; 3) TIRM, a highly sensitive low background fluorescence detectionprocess; and 4) a commercial mega-pixel CCD camera with single photonsensitivity. The method of the invention provides the ability to detectligation events on single target molecules because long lasting signalsare generated only when two ligated probes hybridize to the attachedtarget, bringing donor and acceptor fluorophores to within 6-8 nm ofeach other and within the 500 nm wide evanescent field generated at thearray surface.

The method of the invention typically uses thousands to millions ofsingle molecule DNA fragments, randomly arrayed on an optically clearsurface, which serve as templates for hybridization/ligation offluorescently tagged probe pairs from IPPs (FIG. 6). Pairs of probepools labeled with donor and acceptor fluorophores are mixed with DNAligase and presented to the random array. When probes hybridize toadjacent sites on a target fragment, they are ligated togethergenerating a FRET signal. A sensitive mega pixel CCD camera withadvanced optics is used to simultaneously detect millions of theseindividual hybridization/ligation events on an entire array. Eachmatching sequence is likely to generate several independenthybridization/ligation events, since ligated probe pairs eventuallydiffuse away from the target and are replaced by newly hybridizing donorand acceptor probes. Non-ligated pairs that hybridize near one anothermay momentarily generate FRET signal, but do not remain bound to thetarget long enough to generate significant signal.

Once signals from the first pool are detected, the probes are removedand successive ligation cycles are used to test different probecombinations. The fixed position of the CCD camera relative to the arrayensures accurate tracking of consecutive testing of 256 pairs of IPPs(16×16 IPPs) and takes 2-8 hours. The entire sequence of each DNAfragment is compiled based on fluorescent signals generated by hundredsof independent hybridization/ligation events.

DNA fragments (50-1500 bp in length) are arrayed at a density of about 1molecule per square micron of substrate. Each CCD pixel defines avirtual reaction cell of about 1×1 to 3×3 microns containing one (or afew) DNA fragments and hundreds of labeled probe molecules. The methodof the present invention effectively uses the ability of SBH to analyzemixtures of samples and assemble sequences for each fragment in the mix.The volume of each reaction is about 1-10 femtoliters. A 3×3 mm arrayhas the capacity to hold 1-10 million fragments, or approximately 1-10billion DNA bases, the upper limit being the equivalent of three humangenomes.

The length of DNA fragments that can be analyzed per pixel is a functionof probe length, pool size, and number of pairs of probe pools tested,and typically ranges from 50 to 1500 bp. By increasing the number ofpools and/or probes, several kilobase DNA targets can be sequenced.Partial sequencing and/or signature analysis of 1-10 kb DNA fragmentscan be accomplished using small subsets of IPPs, or even individualprobe pairs.

The rSBH method of the invention preserves all the advantages ofcombinatorial SBH including the high specificity of the ligationprocess. At the same time, it adds several important benefits thatresult from the attachment of DNA fragments instead of probes. DNAattachment creates the possibility of using random DNA arrays with muchgreater capacity than regular probe arrays and allows FRET detection byligation of two labeled probes in solution. In addition, having bothprobe modules in solution allows expansion of the IPP strategy to bothprobe sets, which is not possible in conventional combinatorial SBH.

6.10 Process Steps

rSBH whole-sample analysis has the following processing steps that canbe integrated into a single microfluidics chip (FIG. 9):

1) A simple sample treatment or DNA isolation (if necessary), includingan effective way to collect pathogen DNA on a pathogen cocktail column;2) Random DNA fragmentation to produce targets of proper length;3) Direct end-attachment of DNA to the active substrate surface, forexample by ligation to universal anchors;4) Array washing to remove all unbound DNA and other molecules presentin the sample;5) Introduction of the first IPP pair from two IPP sets at proper probeconcentration and T4 ligase or some other (i.e. thermostable) DNAligase;6) Incubation for less than 1 min with simultaneous illumination andsignal monitoring at 1-10 frames per second;7) Wash to remove the first IPP pair, followed by introduction of thesecond IPP pair; and8) After all IPP pairs are tested, a computer program will generatesignature or sequence for each fragment and then compare them with acomprehensive database of signatures or sequences and report the natureof the DNA present in the sample.

6.11 Device Size, and Characteristics

The device used with the method of the invention is based on thatdescribed in co-owned, co-pending U.S. patent application Ser. No.10/738,108, herein incorporated by reference in its entirety. Theapparatus of the present invention consists of three majorcomponents: 1) the handling sub-system for handling (mixing,introducing, removing) IPPs, it is contemplated that this module can beexpanded to incorporate “on the chip” sample preparation, 2) thereaction chamber—a flow-through chamber with temperature control thatharbors any substrate, and 3) the illumination/detection sub-system(FIGS. 10A, 10B, and 10C). These sub-systems work together to providesingle fluorophore detection sensitivity.

The apparatus of the present invention operates a plug-in reactionchamber with a slot for array substrate and ports for connecting theprobe module, and potentially array preparation module, if DNAattachment and/or in situ amplification is done within the chamber.

The cartridge comprises up to 64 individual reservoirs for up to 32 FRETdonor pools and up to 32 FRET acceptor pools (FIG. 11C). The cartridgecomprises a mixing chamber connected to each of the pool reservoirs bymeans of a single microfluidic channel and an integral vacuum/pressureactuated micro-valve.

6.11.1 the Reaction Chamber

The substrate, once attached to the reaction chamber, forms the bottomsection of a hybridization chamber. This chamber controls thehybridization temperature, provides ports for the addition of probepools to the chamber, removal of the probe pools from the evanescentfield, redistribution of the probe pools throughout the chamber, andsubstrate washing. A labeled probe pool solution is introduced into thechamber and is given time to hybridize with the target DNA (a fewseconds). Probes not involved in a hybridization event are pulled out ofthe evanescent field by creating a voltage potential in thehybridization solution. A high sensitivity CCD camera capable of singlephoton detection is used to detect FRET hybridization/ligation events(Ha, Methods 25:78-86 (2001), herein incorporated by reference in itsentirety), by monitoring the substrate through a window at the top ofthe reaction chamber. Images of the substrate are taken at regularintervals for about 30 seconds. The chamber is then flushed to removeall probes and the next probe pool is introduced. This process isrepeated 256-512 times until all probe pools have been assayed.

6.11.2 the Illumination Sub-System

The illumination sub-system is based on the TIRM background reductionmodel. TIRM creates a 100-500 nm thick evanescent field at the interfaceof two optically different materials (Tokunga et al., Biochem. Biophys.Res. Commun. 235:47-53 (1997), herein incorporated by reference in itsentirety). The apparatus of the present invention uses an illuminationmethod that eliminates any effect that the Gaussian distribution of thebeam would have on the assay. The laser and all other components in thissub-system of the device of the present invention are mounted to anoptical table. A 1 cm scan line is created by moving the mirrors mountedon galvanometers 1 and 2 (FIGS. 10B and 10C). The scan line is thendirected into the substrate through prism 1 by galvanometer 3.Galvanometer 3 is adjusted so that the scan line intersects theglass/water boundary at its critical angle. The beam undergoes totalinternal reflection creating an evanescent field on the substrate. Theevanescent field is an extension of the beam energy that reaches beyondthe glass/water interface by a few hundred nanometers (generally between100-500 nm). The evanescent field of the invention can be used to excitefluorophores close to the glass/water boundary and virtually eliminatesbackground from the excitation source.

6.11.3 the Detector Sub-System

The device of the present invention uses a high sensitivity CCD camera(such as DV887 with 512×512 pixels from Andor Technology (Hartford,Conn.)) capable of photon counting which is suspended above thehybridization chamber. The camera monitors the substrate through thewindow of the reaction chamber. The lens on the camera provides enoughmagnification so that each pixel receives the light from 3 squaremicrons of the substrate. In another embodiment, the camera can bewater-cooled for low-noise applications.

The highly sensitive electron multiplying CCD (EMCCD) detector makeshigh-speed single fluorophore detection possible. Assuming a 1 Wattexcitation laser at 532 nm (for Cy3/Cy5 FRET), the number of photonsemitted from the laser every second can be calculated and the number ofphotons which will reach the detector every second can be estimated.Using the equation e=hc/λ), wherein λ represents wavelength, a photonwith a 532 nm wavelength has an energy of 3.73e-19 Joules. Given thelaser output is one Watt, or one Joule/second, it is expected that2.68e18 photons per second are emitted from the laser. Expanding thisamount of energy across the 1 cm² substrate area, it is expected thateach square nm will receive about 1e-15 Joules of energy, or about26,800 photons. Assuming a quantum yield of 0.5 for the fluorophore, anoutput of about 13,400 photos per second is expected. Using a highquality lens, about 25% of the total output should be collected or atotal of 3350 photons, which are captured by the CCD. Andor's DV887 CCDhas a quantum efficiency of about 0.45 at 670-700 nm where Cy5 emits.This yields approximately 1500 photons per second that each pixelregisters. At 10 frames per second, each frame registers 150 counts. Thedark current of the camera at −75° C. is about 0.001electrons/pixel/sec, on average 1 false positive count every 1000 pixelsonce a second. Even if a 1 false positive count per pixel per second isassumed, at 0.1 per pixel per frame, a 1500:1 signal to noise ratio isobtained. In combination with the TIRM illumination technique, thedetector background is virtually zero.

6.11.4 Miniaturization of the Device

In another embodiment, the method of the present invention can beperformed in a miniature device. A simple physical device, requiringonly a few off-the-shelf components, can perform the entire process. Theillumination and detection components form the core of the system. Thiscore system consists of only a CCD camera, a laser or other lightsource, none to three scanning galvanometers, quartz or equivalentsupports for the substrate, and a reaction chamber. It is possible toplace all of these components in a one cubic foot device. A miniaturefluid-handling robot or micro-fluidics lab-on-a-chip device (FIG. 9)will perform the assay by accessing pairs of IPPs from two libraries of8 to 64 IPPs and can occupy about 0.5 ft³. High-density multi-wellplates or lab-on-a-chips with 64 reservoirs will allow for ultra-compactstorage of the library. A single board computer or laptop can run thedevice and perform the analysis. Such a system is easily transportableand can fit into almost any vehicle for field surveying of theenvironment or responding to emergency crew or biohazard workers. It isalso possible for the device to fit in a medical pack and run on batterypower to perform rapid, accurate screening in the field under almost anycircumstance.

The components of the system include: 1) miniature personal computer (1ft×1 ft×6 in), 2) robotic or lab-on-a-chip fluid handling system (1 ft×1ft×2 in), 3) laser (6 in cube), 4) scanning galvanometers with heat sink(3 in cube), 5) slide/hybridization chamber assembly (3 in×1 in×2 in),6) CCD camera (4 in×4 in×7 in), and 7) fluid reservoirs (approximately10-1000 ml capacity).

Another embodiment of the device of the invention integrates a modularmicro-fluidics based substrate upon which all assays are conducted forpathogen detection (FIGS. 11A, 11B, and 11C). The consumable substrateis in the form of an integrated “reaction cartridge.” The substratecomponent of the cartridge must accept three different kinds ofintegrated disposable modules including: probe pool module, sampleintegration module, and reaction substrate module. All machine functionsact on this cartridge to produce the assay result. This substraterequires integrated fluidics such as quick connects which the reactioncartridge and related modules will provide.

Microfluidics is introduced to the substrate in order to handleinformational probe pools on the detection surface of the substrate. Amodular approach is used in which the initial probe-handling module isdeveloped independent of the substrate and the final design can be addedto the standard substrate cartridge using a “plug and play” approach.The cartridge contains up to 64 individual reservoirs for 32 FRET donorpools and up to 32 FRET acceptor pools (see FIG. 11C). A larger numberof IPPs can be stored on one or a set of cartridges, for example 2×64,or 2×128, or 2×256 or 2×512 or 2×1024 IPPs. The cartridge has a mixingchamber connected to the main channel by its own microfluidic channeland an integral vacuum/pressure actuated micro-valve. When the valve isopened, a vacuum is applied to move a pool into the mixing chamber. Thevalve is then closed, and the process is repeated to add the secondpool. The mixing chamber is in line with the wash pump, which is used toagitate the pools and push them into the reaction chamber.

6.12 Software Components and Algorithms

Row data represents about 3-30 intensity values at differenttime/temperature points for each pair of pools (i.e. IPPs) in eachpixel. Each value is obtained by statistical processing 10-100 CCDmeasurements (preferably 5-10 per second). Each fragment has 512 sets of3-30 intensity values. An array with one million fragments comprisesabout 10 billion intensity values. Signal normalization can be performedon groups of hundreds of pixels. All data points for a given pair ofIPPs will be discarded if the set does not meet expected behavior. Eachpixel (most of which will have proper DNA) with no useful data (i.e. notenough positive or negative data points) will be discarded. Thedistribution of intensity values in other pixels will be determined andused to adjust base calling parameters.

All individual short fragments can be mapped using a score signature toa corresponding reference sequence and analyzed using comparativesequencing processes or is sequence assembled using de novo SBHfunctions. Each approximately 250 base fragment is assembled from aboutone million possible 10-mers starting from the primer sequences. Theassembly process proceeds through evaluation of combined 10-mer scorescalculated from overlapping 10-mers for millions of local candidatesequence variants.

A group of fragments from one array location that has significantoverlapping sequences with groups of fragments from other arraylocations represents a long continuous genomic fragment. These groupscan also be recognized by alignment of short fragment sequences to areference sequence, or as an island of DNA containing pixels surroundedby empty pixels. Assigning short fragments to groups, especially inpartially structured arrays, is an intriguing algorithmic problem.

Short fragments within a group have originated from a fragmented singleDNA molecule and do not overlap. But short sequences do overlap betweencorresponding groups, representing long, overlapping DNA fragments andallow assembly of long fragments by the process identical to sequenceassembly of cosmid or BAC clones in the shotgun sequencing process.Because long genomic fragments in the rSBH process vary from 5-100 kband represent 5-50 genome equivalents, the mapping information isprovided at all relevant levels to guide accurate contig assembly. Theprocess can tolerate omissions and errors in assignment of shortfragments to long fragments and about 30-50% randomly missing fragmentsin individual groups.

The rSBH method of the invention provides detection of rare organisms orquantification of numbers of cells or gene expression for each microbe.When the dominant species has 1× genome coverage, then the species thatoccurs at the 0.1% level are represented by about 10 genomic fragments.DNA normalization can further improve detection sensitivity to 1 cell inmore than 10,000 cells. DNA quantification is achieved by counting thenumber of occurrences of DNA fragments representing one gene or oneorganism. The absence of the cloning step implies that rSBH shouldprovide a more quantitative estimate of the incidence of each DNAsequence type than conventional sequencing. For quantification studies,direct fragmentation of sample to 250 bp fragments and formation ofstandard (non-structured) random arrays is sufficient. Partialnormalization can be used to minimize but still keep occurrencedifference and standardization curves can be used to calculate originalfrequencies. An array of one million fragments is sufficient forquantification of hundreds of genospecies and their gene expression.

6.12.1 rSBH Software

The present invention provides software that supports rSBH whole-genome(complex DNA sample) sequencing. The software can scale up to analysisof the entire human genome (˜3 Gbp) or mixtures of genomes up to 10 Gbp.Parallel computing on several CPUs is contemplated.

The rSBH instrument can generate a set of tiff images at the rate of upto 10/sec or faster. Each image represents a hybridization of the targetto pairs of pooled labeled probes. Multiple images may be produced foreach hybridization to provide signal averaging. The target is fragmentedin multiple pieces approximately 100 to 500 bases long. The fragmentsare attached to the surface of a glass substrate in a randomdistribution. After hybridization and wash of the non-hybridized probes,the surface is imaged with a CCD camera. Ultimately, each pixel of theimage may contain one fragment, although some pixels may be empty whileothers may have two or more fragments. The instrument can potentiallyimage 1-10 million, or even more fragments.

The total instrument run time is determined by thehybridization/wash/image cycle (˜1 min.) multiplied by the number ofpool sets used. With 1024 pool sets (producing 1024 images), the runwill last about 17 hours; two colors reduce this by one half. The imageanalysis software will process the images in near real time and send thedata to the base-calling analysis software.

A. Parallel Processing

The rSBH analysis is ideally suited to parallel processing. Because each“spot” hybridizes to a different fragment, the base-calling analysis canbe run in parallel on each spot with no need for communication betweenthe analyses. The only communication in the entire analysis is betweenthe control module (GUI) and the analysis programs. Very minor stepsneed to be taken to avoid race conditions. In practice the number ofCPU's limits the number of parallel processes. For one million fragmentsa computer with 100 processors will split the job into 100 parallelbase-calling programs which each analyze 10,000 or more fragments, inseries.

A set of 200 fragments can be run on one processor, however it can alsobe run on several CPU's. An optimized base-calling program can finish in˜100 milliseconds if there are no mutations or mutation tests (updatefunction). This time includes data loading and normalizations. Referencelookup time can add ˜100 milliseconds for the longest reference (seebelow). Reference lookup time scales with length and is negligible forthe short lengths. Analyzing multiple mutations can extent the run timeup to about one minute per multiple mutation site. If the averageanalysis time is one second per fragment, one million fragments can beanalyzed in 10,000 seconds using 100 CPU's. Similarly, 200 fragments canbe analyzed in 200 seconds using one CPU or 20 seconds using 10 CPU's.Optimizing the programs for speed requires a significant amount of RAMper CPU. As described below, the software is not limited by memory ifeach CPU has ˜2 GB to 8 GB, depending on the number of CPU's and numberof fragments. Currently it is possible to purchase 32 GB+ of RAM persystem.

B. Data Flow

The GUI and image analysis program run on one CPU, while the basecalling analysis programs run on several (N) CPU's. On startup, theimage analysis program is supplied with the number N and monitors thedirectory that the CCD camera writes tiff images into. For each tifffile, it derives a score for each fragment and group the scores into Nfiles, one for each analysis CPU. For example, if there are 200fragments and 10 CPU's, the image analysis program writes the first 20fragment scores into a file for the first base-calling analysis CPU, thesecond 20 fragment scores into a second file for the second base-callinganalysis CPU, and so on. It is also contemplated that othercommunication modes can be used, for example sockets or MPI. Therefore,the file I/O can be localized to one module so that it can easily beswapped out later.

Over time there a multitude of image analysis files is created for thecontinually growing number of tiff files. The invention provides for aseparate image analysis directory for each base-calling analysis CPU.The bases-calling analysis CPU's each monitor their respective imageanalysis directories and load the data as it becomes available. Theamount of RAM/CPU necessary to store all the image data is [2 bytes×no.fragments×no. images÷N]. This is −2 GB/CPU for 1 million fragments, 1024images and 1 CPU, or 200 KB/CPU for 10 CPU's.

The other significant (in terms of RAM) data input to the base-callinganalysis program is the reference (length L). For speed optimization,the reference is converted to a vector of 10-mer (and 11-mer, 12-mer)positions providing for a quick lookup for the top scoring probes foreach fragment (see below). It is fastest to store the reference positiondata on every base-calling analysis CPU. The amount of memory requiredto store the reference position data is 2 bytes×L, or 2 bytes×4¹²,whichever is greater. The maximum RAM is 2 bytes×10 GB=20 GB. The actualreference itself must also be stored, but this can be stored as 1byte/base or even compressed to 0.25 bytes/base.

Analysis of each fragment generates a called sequence result. These areconcatenated into a file that is written to the image analysis directoryassociated with each CPU. When base calling is complete, the GUIprocesses the called sequence files. It loads all files, from thedifferent CPU's, and reorders the fragments by position to generate afinal complete called sequence. Note that reordering is trivial, as eachfragment was located previously during the reference lookup step. TheGUI can also provide a visualization tool of the called sequence. Inaddition, the GUI can display an intensity graph of the final sequence.In this case the base-calling program must also output the intensityfiles (concatenated as the called sequence data).

The current base calling program outputs a Short Report file based onthe reference and spots scores (from the HyChip™ for example). This maynot be useful for rSBH since the spots for each fragment are distributedamong many hybridization slides. Instead, a new “Short Report” can begenerated for each hybridization that is more abstract than the HyChipShort Report. Specifically, the new report can list the number (N) offull matches on each slide and the median of the highest N scores. Itcan also give the median of any control spots such as markers or emptiesif any exist. The advantage of the new report is that is can be viewedin real time for each image on a constantly updated GUI table. This willtell the user early on (and throughout the run) if the rSBH system isgenerating useful data, instead of waiting a day to see the finalresults. An advanced use of the new report allows user feedback to therSBH instrument. For example, pausing/stopping the run from the GUI orrepeating a pool set if any one failed. The GUI can also displayinstrument parameters in real time during a run, such as hybridizationand wash temperatures. Ultimately, the product can integrate theinstrument into the command and control module of the user GUI.

C. Base Calling

Since the pooled probes are the same for each fragment, the rSBHbase-calling program can read in the pooled probes only once for allfragments. The base-calling program requires a reference sequence input.For rSBH, the reference is derived from an analysis of the clustering ofthe top few hundred scores. A simple binning algorithm of the positionsof the top scores is most efficient, since it requires a single passthrough the binned positions to find the maximum bin counts. The windowof maximum bin counts locates the position of the fragment in thereference. With 250 bp fragments and 1024 measurements, ¼ of thefragment scores are positive (i.e. full match hybridization score).Then, due to the complexity of the pooled probes, ¼ of the 10-mersrepresent positive scores. Furthermore, for a reference longer than 4¹⁰,the probes are repeated, so that ¼ of all 10-mers in the reference arepositive. The same applies for 11-mers and 12-mers; ¼ of all referenceprobes are positive. For a processor able to bin one probe in 1 nsec, itwould take [L±4±10⁹] seconds to find the reference for a fragment. Forthe extreme L=10,000,000,000, this is 2.5 seconds/fragment using oneCPU. For 1 million fragments and 100 CPU's the total time to find thereferences is 25,000 seconds (6-8 hours).

An alternative to binning the top L scores is to perform a de novo typeof sequence assembly on each fragment to reduce the number of probes tomuch less than 250 used in the example above. This will speed thefragment lookup process if the de novo algorithm is fast (e.g. less than1 msec). A fast de novo algorithm can involve finding a few sets of 10or more of the top 250 scores that have overlapping probes and canreduce required time an order of magnitude or more.

D. Base Calling Algorithm

-   -   1. Read probe pool files    -   2. Read reference (length RL) and store into Reference object.        -   2a. Generate reference positions data structure.    -   3. Read intensity files (in real time as they are generated from        image analyses).        -   3a. Store values into Scores data structure.    -   4. Accumulate about top L scores for each fragment (of median        length L).    -   5. Analysis loop for each fragment:        -   5a. Create a list of positions in the reference for the top            L scores.        -   5b. Create a vector whose length is [RL÷(m×L)], to bin the            top score positions into. This gives a bin length of m×L,            where m should be ˜1.5 to provide a margin on either side of            the fragment.        -   5c. Bin the positions for the top L scores into the binning            vector.        -   5d. Find the region of highest total bin count. This gives            the fragment reference to within (m−1)×L base positions.        -   5e. Perform base calling using fragment reference.        -   5f. Concatenate the called sequence onto a file: called            Sequence (include the position information)    -   6. End of analysis loop for each fragment.

6.13 Additional Embodiments

The method of the present invention allows for multiple mechanisms bywhich probes and IPPs are designed. In one embodiment, probes and IPPsare designed by varying the number of probes per pool, morespecifically, in the range of 4 to 4096 probes per pool. In a secondembodiment, probes and IPPs are designed by varying the number of poolsper set, more specifically in the range of 4 to 1024 pools per set.Probes may have 2 to 8 informative bases providing a total of 4-16bases. In yet another embodiment, probes are prepared as pools withdegenerate synthesis at some positions. A further embodiment compriseshaving two assemblies of two sets of IPPs wherein different probes aremixed within one pool.

A small set of 20 to a few hundred probes can provide a uniquehybridization signature of individual nucleic acid fragments.Hybridization patterns are matched with sequences to identify pathogensor any other nucleic acid, for example for counting mRNA molecules. Oneembodiment of the method of the invention uses signatures to recognizeidentical molecules on different random arrays. This allows, afterhybridizing the same set of probes on different arrays to producesignatures, hybridization of different subsets of test probes ondifferent arrays prepared from the same sample followed by combinationof data per individual molecules.

Another embodiment of the method of the invention performs singlemolecule DNA analysis without combinatorial ligation, using only asingle set of IPPs or individual probes. In this embodiment, FRETsignals are detected by labeling the target with a donor fluorophore andthe probes with an acceptor fluorophore, or labeling the target with anacceptor fluorophore and the probes with a donor fluorophore. Probes inthe form of 5′-N_(x)-B₄₋₁₆-N_(y)-3′ may be synthesized individually oras pools containing degenerate (mixed) bases at particular positions. Inanother embodiment, probe/probe pool hybridization are combined withpolymerase-based extension of the hybridized probe by incorporation ofone or more labeled nucleotides, wherein the nucleotides are typicallydifferentially labeled.

Another embodiment of the method of the present invention utilizes proberemoval to achieve multiple tests of a target molecule with the sameprobe sequence, probe molecules can be repeatedly removed from andtoward the support surface using electric field, magnetic field, orsolution flow. The cycles occur from every 1-10 seconds up to 20-30seconds. Fluorescent signals are recorded for each phase of the cycle oralternatively, only after probe removal is initiated, or only afterprobe removal is completed. The removal is coupled with temperaturecycling. In this embodiment, probe removal does not require FRETlabeling and instead relies on direct fluorescence from one label.Alternatively, the FRET reaction occurs between a labeled probe and adye molecule attached to a target molecule.

A further embodiment of the method of the invention involving repeatedtesting of a probe sequence utilizes repeated loading of the same probespecies from the outside container into the reaction chamber. A quickremoval of the previous probe load is first followed with a wash bufferthat does not remove full-match hybrids (the product of ligation of twoprobes if ligation is used), but removes free probes. A second wash isused that melts all hybrids before a subsequent probe load isintroduced.

In another embodiment, each probe species interaction with a targetmolecule is measured only once. This process relies on redundantrepresentation of the same DNA segment at different places within thearray and/or on the accuracy of a one-time ligation event.

In addition to preparing final fragments before loading a sample on thesupport to form an array, a two-level cutting procedure is used inanother embodiment of the method of the invention. Sample DNA is firstrandomly cut to form longer fragments (approximately 2-200 kb or more).A mixture of these fragments is loaded on the support that may bepatterned by hydrophobic material in the form of a grid comprising cellsof approximately 10×10 μm² in size. Concentration of the sample isadjusted such that predominantly one or a few long fragments will bepresent in each cell. These fragments will be further randomlyfragmented in situ to a final fragment length of approximately 20-2000bases and attached to the support surface. The optimal cell size dependson the total length of the DNA introduced per cell, the preferred lengthof the final fragments, and the preferred density of the finalfragments. This fragmentation method of the invention provideslong-range mapping information because all short fragments in one cellbelong to one or a few long fragments from long overlapping fragments.This inference simplifies the assembly of long DNA sequences and mayprovide whole chromosome haplotype structure.

In another embodiment of the present invention, selected target DNA iscaptured from the complex sample using, for example, a column containingan equalized number of DNA molecules for certain genes or organisms. Forexample, selected viral or bacterial genomes or parts of genomes can berepresented on these columns in the form of attached single-stranded DNA(ssDNA). Sample DNA is melted if double-stranded DNA (dsDNA) andcomplementary strands are captured by hybridization to immobilized DNA.The excess of complementary DNA or any other unrelated DNA is washedout. The captured DNA is then removed by high temperature or chemicaldenaturation. This process can be used to remove human and other complexDNA for diagnostics of infectious agents. It also provides a method toreduce the concentration of over-represented agents in order to detectother agents present in a low copy number present on a smaller array.The capture process can be performed in tubes, wells of multi-wellplates or in microfluidics chips.

Selection of specific genes or other genomic fragments is achieved bycutting DNA with restriction enzymes with downstream cutting andligation of matching adaptors (described in co-owned, co-pending U.S.patent application Ser. No. 10/608,293, herein incorporated by referencein its entirety). Fragments that are not captured by adapters will bedepredated or otherwise removed. Another embodiment usesoligonucleotides of 6-60 bases, or more preferably, 10-40 bases, or evenmore preferably, 15-30 bases designed to match a given sequence with oneor more mismatches allowing cutting of DNA using mismatch recognitionalong with cutting enzymes. Two oligonucleotides can be designed forcutting complementary strands with about a 1-20 base shift creating asticky end for ligation of an adaptor or ligation to a vector arm. Twopairs of such oligonucleotide cutting templates from a genomic fragmentcan be obtained and captured or end modified for capture with a specificadaptor(s). Cutting templates are synthesized, or alternatively, one ormore libraries of short oligonucleotides are designed to provide auniversal source of necessary cutting templates for any DNA. Librariesof 256 oligonucleotides represented by the following consensus sequencesnnnbbbnn, nnbbbbnn, or cggnnnbbbbnn, nnbbbnn, nnbbbnnncac, wherein nrepresents a mixture of four bases or a universal base, b represents aspecific base, bbbb represents one of 256 possible 4-mer sequences, cggand cac represent examples of specific sequences shared by all membersin the library, can be used to create cutting templates. To createcutting templates, an assembly template of nnnnnnnnnnnnnnnnnn, orgccnnnnnnnnnnnnnnnnnnnnnnnnngtg, may be used to ligate two or threemembers selected from corresponding oligonucleotide libraries.

In addition to various chemical attachment approaches, DNA fragmentsprepared by random cutting or by specific cutting may be attached to thesurface using adaptors attached to fragments of anchors, adaptors,primers, other specific binders attached to the surface or both. Oneembodiment uses randomly attached anchors with sticky ends ofapproximately 1-10 bases in length and ligates ssDNA fragments or dsDNAfragments with matching sticky ends. Sticky ends may be provided byadaptors attached to DNA fragments. This approach provides thepossibility to have sections of substrate with anchors having differentsticky ends to identify the end sequence of the attached fragment.Another embodiment attaches the primer to a support that iscomplimentary to an adaptor attached to a DNA fragment. After ssDNAhybridizes to primers, the polymerase is used to extend the primer. Theproduced dsDNA is melted to remove strand that is not attached to thesupport of use for DNA amplification as described below. Yet anotherembodiment coats the surface with specific binders (for example, cyclicpeptides) that recognize 3′ or 5′ ends of DNA fragments and binds themwith high affinity.

Analysis of short fragments attached to adaptors on one or both sidesmay help in reading through palindromes and hairpins because when thereis a cut within a palindrome/hairpin, a new adaptor sequence will beattached that is not complementary to the rest of the sequence. Adaptorsallow every base of the target DNA to be read with all overlappingprobes.

In yet another embodiment, detection accuracy and efficiency isincreased by using random arrays of single molecules followed by insitu, localized amplification (Drmanac and Crkvenjakov, 1990, supra,herein incorporated by reference) to generate up to 10, up to 100, up to1000, up to 10,000 replica molecules attached within the same pixelarea. In this case, there is no need for single molecule sensitivitybecause multiple scores of probes are not necessary, even though FRETand TIRM may still be used. The amplification process comprises thefollowing steps: 1) using a support coated with one primer (about1000-50,000 primer molecules/m²), 2) using sample DNA fragments modifiedwith a ligated adaptor and second primer in solution. There is a need tominimize mixing and diffusion, for example by using a capillary chamber(a cover slip with only 10-100 μm space from the support) or embeddingthe target and second primer in a gel. The population of moleculesgenerated by amplification for a single target molecule will form aspot, or “amplicon”, that should be less than 10-100 μm in size.Amplification of hybridization or ligation events may also be used toincrease the signal.

A preferred embodiment uses continuous isothermal amplification (i.e.different types of strand displacement) because there is no need todenature dsDNA using high temperature, which can cause large-scalediffusion or turbulence, the displaced strand has no other complementaryDNA to bind to other than the attached primer, and a high localconcentration of DNA can be produced. Another embodiment usingisothermal amplification is to design at least one adaptor (for one endof the target DNA) with a core sequence that has a low meltingtemperature (i.e. using TATATAT . . . sequence having between 3-13 TArepeats) and primers substantially matching to this core sequence. Atthe optimal temperature for the polymerase capable of stranddisplacement used in this reaction, the dsDNA at the TATATA . . . sitewill locally melt allowing hybridization of the primer and initiation ofa new cycle of replication. The length (i.e. stability) of the core canbe adjusted to accommodate temperatures between 30-80° C. In thisContinuous Amplification Reaction (CAR), new strand synthesis can startas soon as the enzyme performing the previous synthesis moves from thepriming site, which takes about a few seconds. The process is used toproduce high concentrations of ssDNA starting with dsDNA if only oneprimer is used. For amplification where one primer is attached to thesurface, the low temperature melting adaptor should be for thenon-attaching end and the corresponding primer will be free in solution.CAR does not require any other enzymes in addition to the polymerase.Adaptors are introduced by ligation with DNA fragments or tailextensions of target specific primers for two or more initialamplification cycles on source dsDNA that may require melting by hightemperature.

The nucleic acid analysis processes described above based on probe/probepool hybridization alone or in combination with base extension or twoprobe ligation to random arrays of sample DNA fragments is used forvarious applications including: sequencing of longer DNA (includingbacterial artificial chromosomes (BACs) or entire viruses, entirebacterial or other complex genomes) or mixtures of DNA; diagnosticsequence analysis of selected genes; whole genome sequencing of newbornbabies; agricultural biotech research for precise knowledge of thegenetic makeup of new crops and animals; individual cell expressionmonitoring; cancer diagnostics; sequencing for DNA computing; monitoringthe environment; food analysis; and discovery of new bacterial and viralorganisms.

The method of the present invention generates sufficient signal from asingle labeled probe while reducing the background below the thresholdof detection. Special substrate material or coating (such asmetallization) and advanced optics are used to reduce high systembackground that prevents parallel detection of millions of singlemolecules from a 1 cm² surface. Alternatively, background that isintroduced with the sample or during the DNA attachment process isreduced by special treatment of the sample, including affinity columns,modified DNA attachment chemistry (e.g. ligation), or binding molecules(e.g. cyclic peptides) with exclusive DNA specificity. In someinstances, reduction of background produced by non-ligated probecomplexes in solution or assemblies on the substrate requires cyclicremoval of non-hybridized/ligated probes by electric field pulsing,specially engineered ligase with optimized thermal stability and fullmatch specificity, or triple FRET system with a third dye (e.g. quantumdot) attached to the target molecule.

In another embodiment, the method of the invention requiresconcentration of DNA molecules on the support by an electric field inorder to capture all fragments from a chromosome or genome on a randomarray surface. Chromosome fragmentation to allow correct assembly mayrequire compartmentalized substrate and in situ fragmentation of initialindividual 100 kb to 1 Mb DNA fragments to obtain linked groups ofshorter 1-10 kb fragments.

Obtaining fast hybridization/ligation to allow multiple interrogationsof the target with one pair of probe pools in less than 60 seconds/cyclemay require the use of optimized buffers and/or active probemanipulation, potentially using electromagnetic fields. Fluorescent dyes(or dendrimers) with excitation properties compatible with DNA stabilityand precise control of illumination (nanosecond laser pulsing) are usedto increase the chemical and physical stability of the system (includingarrayed target DNA molecules) to tolerate several hours of illumination.

Fast real time image processing and assembly of individual fragmentsfrom overlapped probes and entire genome from overlapped DNA fragmentsmay require programmable logic arrays or multiprocessor systems for highspeed computation.

The method of the present invention relies on specific molecularrecognition of complementary DNA sequences by labeled probes and DNAligase to generate visible fluorescent signals. By relying on naturallyevolved sequence recognition and enzymatic proofreading processes, rSBHeliminates the significant technical challenges of physicallydistinguishing individual DNA bases that are only 0.3 nm in size anddiffer by only a few atoms from one another. The method of the presentinvention also has very simple sample preparation and handling involvingrandom fragmentation of chromosomal or other DNA and formation of small(1-10 mm²), random single-molecule arrays containing approximately oneDNA molecule per square micron. The method of the present inventionsimultaneously collects high speed data on millions of single moleculeDNA fragments. Using ten fluorescent colors and a 10 mega pixel CCDcamera, a single rSBH device can read 10⁵ bases per second. The readlength of the present invention is adjustable, from about 20-20,000bases per fragment, and totaling up to 100 billion bases per singleexperiment on one random array. By initial fragmentation of individuallong fragments and attachment of corresponding groups of short fragmentsto isolated random subarrays, the effective read length of the rSBHprocess may be up to 1 Mb. Maximal sequencing accuracy assured byobtaining 100 independent measurements per base for each single DNAmolecule tested (i.e. 10 overlapping probe sequences, each tested onaverage by 10 consecutive ligation events to the same DNA molecule).

Combinatorial SBH using IPPs provides over 99.9% accurate sequence dataon PCR amplified samples several thousand bases in length. This readlength is many times longer than that obtained by currently usedgel-based methods and provides whole gene sequencing in a single assay.The method of the present invention combines the advantages ofparallelism, accuracy and simplicity of hybridization-based DNA analysiswith the efficiency of miniaturization and low material costs of singlemolecule DNA analysis. Application of universal probe sets,combinatorial ligation and informative probe pools allows efficient andaccurate analysis of any and all DNA molecules and detection of anysequence changes within them using a single small set of oligonucleotideprobe pools. The method of the present invention uses an integratedsystem to apply well-known biochemistry and informatics on ultra-highdensity, random single-molecule arrays to achieve a dramatic 1,000 to10,000 fold higher sequencing throughput than in current gel and SBHsequencing methods. The method of the present invention will allowsequencing of all nucleic acid molecules present in complex biologicalsamples, including mixtures of bacterial, viral, human and environmentalDNA without DNA amplification or manipulation of millions of clones.Minimized sample handling and low chemical consumption and a fullyintegrated process will decrease the cost per base, at least as much as1,000 fold or more. The method of the present invention is capable ofsequencing the entire human genome on a single array within one day.

Random arrays of short DNA fragments are easily prepared at densities100 fold higher than most standard DNA arrays currently in use. Probehybridization to such arrays and advanced optics allows the use ofmega-pixel CCD cameras for ultra-fast parallel data collection. Eachpixel in the array monitors hybridization of a different DNA moleculeproviding tens of millions of data points at a rate of 1-10 frames persecond. Random arrays can contain over 100 billion base pairs on asingle 3×3 mm surface with each DNA fragment represented in 10-100 pixelcells. The inherent redundancy provided by the SBH process (in whichseveral independent overlapping probes read each base) helps assure thehighest final sequence accuracy.

To achieve the full capacity of the ligation method of the invention,which allows reading of up to 1000 bases per molecule, multiple IPPreagents must be handled simultaneously. The ligation method of theinvention eliminates the need to covalently modify every target moleculeanalyzed. Because SBH probes are not covalently attached to targets,they can be easily removed or photo-bleached between cycles. Inaddition, the inclusion of a polymerase ensures that a base can betested only once in any given DNA molecule. The hybridization/ligationprocess of the present invention allows multiple interrogations witheach given probe and multiple interrogation of each base by severaloverlapping probes, providing a 100 fold increase in the number ofmeasurements per base. In addition, ligase allows larger tag structuresto be utilized (i.e. dendrimers with multiple fluorophores or Q-dots)than polymerase, which may further increase detection accuracy.

The method of the present invention can generate universal signatureanalysis of long DNA molecules using smaller incomplete sets of longuniversal probes. Single molecules up to 10 kb may be analyzed perpixel. An array of 10 million fragments, each 10,000 bp in length,contains one trillion (10¹²) DNA bases, the equivalent of 300 humangenomes. Such an array is analyzed with a single 10 mega pixel CCDcamera. Informative signatures are obtained in 10-100 minutes dependingon the level of multiplex labeling. An analysis of a 10- or 100- or1000-fold smaller array is very useful for signature or sequencing orquantification applications.

In one embodiment, a single pathogen cell or virus is represented with10-10,000 fragments in the array, thereby eliminating the need for DNAamplification. The single molecule signature approach of the instantapplication provides a comprehensive survey of every region of thepathogen genome, representing a dramatic improvement over multiplexamplification of thousands of DNA amplicons analyzed on standard probearrays. DNA amplification is a non-linear process and is unreliable at asingle molecule level. Instead of amplifying a few segments perpathogen, the concentration of unwanted or contaminating DNA is reducedusing pathogen affinity columns, and the entire genome of the collectedpathogens can be analyzed. A single virus or bacterial cell can becollected from among thousands of human cells and is represented by 1 to10 kb fragments on 10-1000 pixels, providing accurate identification andprecise DNA categorization.

In another embodiment, the method of the present invention is used todetect and defend against biowarfare agents. rSBH identifies structuralmarkers allowing immediate detection of bioagents at a single organismlevel before pathogenicity and symptoms develop. rSBH provides acomprehensive analysis of any or all of the genes involved in thepathogen's mode of attack, virulence, and antibiotic sensitivities inorder to quickly understand the genes involved and how to circumvent anyor all of these genes. rSBH can analyze complex biological samplescontaining mixed pathogens, host, and environmental DNA. In addition,the method of the invention is used to monitor the environment and/orpersonnel using rapid, low cost comprehensive detections methods and canbe made portable.

6.14 Kits

The present invention also provides for IPP kits to load on thecartridge or cartridges with preloaded probes as products, optionallyincluding ligation mix with buffer and enzyme.

The present invention also provides for pathogen/gene-specific samplepreparation kits and protocols for pathogens such as Bacillus anthraciaand Yersina pestis, from, for example, blood samples. The presentinvention provides for integration of sample preparation DNA productsinto the substrate resulting in the formation of the rSBH array of theinvention. A stepwise process is described that yields an array of anindividual target per pixel and an optional in situ amplificationyielding 10-1000 copies per pixel. The result is a random array oftarget DNA that is subjected to rSBH for sequence analysis. The modularapproach of the invention to the evolving substrate allows earlyversions of the substrate to have a simple sample application site,whereas final development may have a “plug and play” array preparationmodule.

DNA samples meeting the minimal purity and quantity specifications willserve as starting material for real sample integration with the rSBHsample arraying technology. Sample integration begins with enzymaticdigestion (restriction enzyme or nuclease digest) of the products fromthe crude sample creating specific (or random) sticky ends providingfragments roughly 250 bp in length. This enzyme cocktail represents oneof several components that would be provided in a product kit.

Arraying of the digest involves ligation of the sticky ends tocomplements arrayed onto the surface. The array surface is modified fromits original glass surface as follows: 1) formation of anaminopropylsilane monolayer; 2) activation with a symmetricdi-isothiocynate; and 3) using a novel cocktail of aminolatedoligonucleotides (including capture probes, primer probes and spacerprobes) the activated array surface is modified with a heterogeneousmonolayer of probes.

All of the attached probes share a conserved design (>90%), thuspreventing the formation of homogeneous islands in which spacer andcapture probes are segregated. The ratio of capture probe to all otherprobes gives rise to an average density equal to 1 complementaryligation site (for sample and capture probe) per each square micron, andeach square micron is observed by an individual pixel of anultra-sensitive CCD. Next, by adding the digested DNA sample to thepre-formed array surface and ligating with T4 ligase to capture probes,the novel rSBH reaction site is achieved consisting of one target perpixel. The excess sample is removed from the array surface and viaheating and additional washing, the dsDNA gives rise to ssDNA. Here, aphosphorylation strategy is employed within the capture probe design toassure only one strand is actually covalently ligated to the rSBH arrayand the other is removed by the wash.

Localized in situ amplification of targets may be necessary to createsatisfactory signals (amplitude and accuracy) for detection adaptingwell-known techniques (Andreadis and Chrisey, Nucl. Acids. Res. 28:E5(2000); Abath et al., Biotechniques 33:1210 (2002); Adessi et al., Nucl.Acids Res. 28:E87 (2000), all of which are herein incorporated byreference in their entirety). Isothermal strand displacement techniquesmay be the best suited for localized low copy number amplification. Inorder to space the capture probe, it is necessary to dope in spacerprobes and primer probes. These probes share some conserved sequence andstructure and each can function in the role described by their name.Hence, capture probes capture the target DNA, spacer probes help formthe properly spaced monolayer of probes, and if necessary, primer probesare present for the in situ amplification. All targets work off the samearrayed primer sequence simplifying the task. Once the sample is ligatedto the array, the free termini of the arrayed DNA will get a universalprimer for amplification. The in situ amplification is conducted on themolecules within the array using standard protocols and materials (i.e.primers, polymerase, buffer, NTPs, etc.). Only approximately 50 copiesare needed, although 10-1000 would suffice. Each target can be amplifiedwith different efficiency without affecting sequence analysis.

In summary, sample integration and rSBH array formation requires DNAdigestion of the product from crude sample preparation, isolation, andintegration into the substrate to form the rSBH array. The presentinvention provides for reagents and kits related to each of thedigestion, isolation and ligation steps.

7. EXAMPLES 7.1 Sequencing a Bacterial Genome

The entire bacterial genome of a common non-virulent lab strain issequenced. An E. coli strain is chosen that has been well characterizedand the sequence is already known. The entire genome is sequenced in asingle one-day assay. This assay demonstrates the full operation of thediagnostic system as well as defines the critical specifications relatedto projecting input and output of the system and universal requirementsfor crude sample isolation and preparation.

A single colony from a streak plate or a few milliliters of liquidculture provides ample material. Cells are lysed and DNA is isolatedusing protocols well known in the art (see Sambrook et al., MolecularCloning: A Laboratory Manual, Cold Spring Harbor Laboratory Press, NY(1989) or Ausubel et al., Current Protocols in Molecular Biology, JohnWiley & Sons, New York, N.Y. (1989), both of which are hereinincorporated by reference in their entirety). The yield is not critical;rather the quality of DNA is the important factor. Sample specificationsdefined in this example apply to all other samples. For final analysis,a genome copy number of 10-100 copies is used. The additionalrequirements for this assay are: 1) the DNA is free of DNA processingenzymes; 2) the sample is free of contaminating salts; 3) the entiregenome is equally represented and constitutes a majority of the totalDNA; 4) the DNA fragments are between 500 and 50,000 bp in length; and5) the sample is provided as a sterile solution of DNA at a knownconcentration (for example, 1 μl at 1.0 μg/ml is sufficient).

The input copy number of 10-100 copies assures overlap of the entiregenome and tolerates poor capture of targets on the array. With 10-100copies, enough overlapping fragments are obtained to assure excellentsuccess of base calling and high accuracy. The mass of the rSBH sampleis approximately 1-10 pg, the majority of which is used to characterizeand quantify the sample. Samples for analysis are obtained by serialdilution of the characterized product.

The DNA must be free of proteins, particularly nucleases, proteases, andother enzymes. Phenol-based extractions, such as PCl, are used to removeand inactivate most proteins (Sambrook et al., 1989, supra; Ausubel etal., 1989, supra). Hypotonic lysis or detergent-based lysis (withnuclease inhibitor cocktails such as EDTA and EGTA) followed by PClextraction is a rapid and efficient sample digest and DNA isolation in asingle step. A phase lock extraction (available through 3′5′) simplifiesthis task and yields clean DNA. No digestion of the DNA is required atthis time since sheer forces during lysis and extraction give rise tofragments in the desired range. Remove of phenol is achieved throughrigorous cleaning of the DNA (i.e. subsequent chloroform extraction,ethanol precipitation, and size exclusion). Phenol leaves an ultraviolet(UV) spectral signature which is used to test for purity and DNAquantification.

The DNA must be free of contaminating salts and organics and suspendedin an SBH compatible Tris buffer. This is achieved by size exclusionchromatography or micro-dialysis.

The crude DNA sample ranges from 500 bp to 50,000 bp. Fragments below500 bp are difficult to recover in isolation and purification and alsoaffect the arraying process. Fragments larger than 50,000 bp aredifficult to dissolve and can irreversibly aggregate.

The sample is provided as a sterile solution of at least 1 W at 1 μg/ml.The total required amount of crude DNA is only ˜1 ng to 1 pg, which isless than 1% of the amount carried over to sequencing.

For the final sample preparation, the DNA is digested to yield fragmentsof an expected average length of approximately 250 bp harboring stickyends which are used to array the molecules on the combinatoric arraysurface. The molecules are spaced such that one molecule is found persquare micron, which is observed by a single pixel of a CCD camera andrepresents a virtual reaction well within an array of millions of wells.This requires elimination of self-assembly monolayer (SAM) effects. Anenzyme-driven protocol is used which ligates samples to specific sitesthat are spaced within a combinatorial array monolayer that ischemically attached to the surface of the detection substrate. Thecapture array is driven via SAM chemistry, but the small variance in theterminal complementary overhangs should not give rise to islands of likesequence. Thus, the substrate is prepared with the capture array andsamples are attached to the substrate surface by enzymatic ligation ofappropriate overhangs.

Alternatively, it may be necessary to amplify each target in situresulting in an “amplicon.” The amplification is achieved using auniversal primer adaptor that is ligated to the target sequence by thetermini that did not get attached in the initial capture ligation. DNApolymerase and NTPs are used to synthesize a new strand and displace theoriginal complement, providing a displaced strand which hascomplementary elements in the capture array and thus in turn is capturedand ligated. It is expected to generate ˜10 copies through linearamplification. Alternatively, exponential amplification strategies canbe used to yield 100-1000 copies per micron.

The arrayed sample, either single molecule or localized amplicon, issubjected to rSBH cycle sequencing using dedicated probes and integratedmicrofluidics. Bioinformatics is fully integrated for data collection,storage, analysis and sequence alignment. The result is reported as thegenomic sequence of the candidate organism with statistical analysis ofbase calling and accuracy.

7.2 Sample Preparation from B. Anthracis and Y. Pestis Cell Cultures orBlood Samples

7.2.1 Whole Genome Analysis

Isolation of a specific pathogen from a crude sample requires isolationor enrichment of the cells from the crude sample followed by lysis toyield the specific genome. Standard biochemistry and cell biology labtechniques, such as fractional centrifugation, filtration, culture, oraffinity chromatography, are used to isolate the cells and then extractthe genome. Typically, most pathogens are at least two orders ofmagnitude smaller than human cells and orders of magnitude larger thanmost bio-molecular structures, thus allowing reasonably facile isolationby traditional physical techniques. It is preferable to employcommercially available antibodies or other affinity tools alreadyavailable for certain targets, such as viral coat proteins, tostreamline isolation and minimize risk. Upon enrichment of theorganisms, they are lysed using standard procedures and the DNA isisolated.

Alternatively, genomic amplification can be done using specific primers(for heterogeneous crude samples) harboring reversible affinity tags ora universal set of primers (for isolated cell types). Samples aresubjected to lysis and if necessary, crude DNA isolation. Primers areadded to the crude sample along with the amplification cocktail and theproduct is isolated through reversible tagging and affinity capture.

7.2.2 Genomic Footprint Analysis

This method involves amplification of a specific set of footprintinggenes specific to the organism of interest. By simultaneously examiningmultiple genetic regions, different strains of the same pathogen can bedistinguished or large numbers of distinct pathogens can be screened.Assays that can be used to detect a variety of biothreat pathogens aredescribed in (Radnedge et al., App. Env. Micro. 67:3759-3762 (2001);Wilson et al., Molecular and Cellular Probes 16:119-127 (2002); Radnedgeet al., Microbiology 148:1687-1698 (2002); Radnedge et al., Appl. Env.Micro. in press (2003), all of which are herein incorporated byreference in their entirety). Regions of DNA are identified that arespecific to the pathogen of interest, but not present in close relativesof the pathogen. Primers are then designed to check for amplification ofa DNA product in environmental samples. B. anthracia and Y. pestis areused as model organisms. Defined quantities of pathogen cells are mixedwith human blood to determine the sensitivity of detection. Anearly-stage symptomatic patient will have >10⁴ cells/ml blood for eitherof these pathogens. The goal is to detect the pathogen before it gets tothe symptomatic stage. Blood samples are examined that possess from 10¹to 10⁵ cells/ml to determine the accuracy of detection. Genomic DNA isextracted using the QiaAmp Tissue Kit 250 (Qiagen, Inc., Valencia,Calif.) or the NucleoSpin Multi-8 blood kit (Macherey-Nagel Inc., Duren,Germany). Pathogen concentration is determined by plating mid-log cellsand by microscopic counting with a haemocytometer: 10 μl of dilutedcells are added to 190 μl of human blood to approximate pre-symptomaticconcentrations. Genomic DNA is then extracted and prepared foramplification of diagnostic targets and genes.

7.3 Assay for Preparation of 100 Diagnostic Targets from Biohazard-FreeB. Anthracis and Y. Pestis DNA Samples

Targets are selected to identify regions of potential antibioticresistance, mutations in virulence genes and vector sequences suggestiveof genetic engineering. Such targets, especially virulence andantibiotic resistance genes, are generally not unique to a specificpathogen but provide additional qualitative information. Targeted DNAwill be amplified with 50 primer pairs to interrogate relevant uniqueand qualitative regions of each pathogen. The products are pooled intoone sample for SBH analysis. Multiplex primer pairs can be used tosimplify the amplification of target sequences.

Primers are used that have a cleavable tag for isolation of the ampliconfrom the original complex DNA mixture. Preferably, the tag isbiotin/streptavidin-based with a DTT cleavable disulfide bridge orspecifically engineered restriction site within the primer. Ampliconsare isolated by the affinity tag and released as a purified DNA sample.Products are further purified by size exclusion to remove any unwantedsalts and organics and then quantified for downstream integration.

7.4 Sequencing Samples from Microbial Biofilms

rSBH in combination with field studies and FISH is used to examine thebiofilm community genome. Using rSBH a biofilm community is sequenced atmore than one time point and from distinct habitats to determine thenumber of genospecies. The analysis is facilitated by DNA normalizationbetween samples to highlight differences in the genospecies level ofcommunity structure and to provide significant coverage of the genomesof low abundance genospecies. 16S rDNA clone libraries are constructedfor each sample according to well-established protocols. FISH probes todistinguish phylotypes and targeted to SNPs to distinguish subtlevariants within phylotypes are used to map out patterns of distributionand allow correlation between SBH-defined genospecies and 16S rDNAphylotype distribution. Samples are collected from physically andchemically distinct habitats and key environmental parameters aremeasured at the time of sample collection, including pH, temperature,ionic strength, redox state (i.e. the Fe²⁺/Fe³⁺ ratio), andconcentration of dissolved organic carbon, copper, zinc, cadmium,arsenic and other ions.

7.5 Base Calling Simulation Test

Simulated data was generated for E. coli with 250 bp (average length)fragments overlapped by 90%. The first 10,000 fragments were analyzedusing standard single base change calling. This amount was more thansufficient to check for accuracy and timing. The reference lookupsuccessfully found the 10,000 fragment positions in the full 4 Mbpreference genome. Additionally, base calling was correctly performed oneach of the fragments. Each fragment was binned against the full 4 Mbpreference, which validated the lookup timing and accuracy, independentof the number of fragments tested. The time required for referencelookup and base calling was 0.8 seconds/fragment. The base callingincluded testing for single base changes and normalizations used tooptimize the accuracy. A margin on either side of the fragment wasallowed in the reference lookup, which increased the resolving time.

7.6 Arraying and Imaging Individual Q-Dots

Two microliters of 0, 8, 160 and 400 pM streptavidin-conjugated Qdots(Qdot Inc, Hayward, Calif.) were deposited on the surface (in the centerof the coverslip) of biotin-modified coverslips (Xenopore Inc.,Hawthorne, N.J.) for 2 min. The droplet was removed via vacuum. 10 μl ofDI water was applied and removed in the same manner This wash wasrepeated 4 times. The coverslip was placed treated side down on a cleanglass slide. 1 μl of water was used to stick the slide to the surface. Asmall amount of objective immersion oil was placed at the edge of thecoverslip to stop evaporation by creating a seal around the coverslip.

The substrate was imaged using a Zeiss axiovert 200 withepi-illumination through a Plan Fluar 100× oil immersion (1.45 na)objective. A standard chroma Cy3 filter set was used to image the 655 nmemission from the Qdots. The transmission spectra for the chroma Cy3emission filter overlaps with the emission filter for the 655 nm Qdots.Images were recorded using a Roper Scientific CoolSNAP_(HQ)™ camera(Roper Scientific, Inc., Tucson, Ariz.) using a 50 ms exposure time.From the images, it was apparent that higher Qdot concentrationsproduced more visible spots. Control coverslips spotted with water hadonly a few visible spots due to various contaminations. In addition toseeing groups of Qdots with steady fluorescence of expected color,individual blinking spots of varying intensity and color were also seen.These features indicated that these small spots were single Qdots. Theintensity differences may be explained by far-off wavelength, out of thefocus plane, or by variation in activity between individual particles.The significance of these results is that individual molecules, iflabeled with Qdots, can be detected with advanced microscopy. Furtherreduction of background using the TIRF system and more efficientexcitation by laser is expected to allow routine accurate detection ofsingle fluorescent molecules.

7.7 Ligation Signals and Spotted Targets and Oligonucletoides

These experiments were designed to demonstrate: 1) spotted target can beused as a template for ligation of two probes with good full-matchspecificity; 2) spotted oligonucleotides can be used as primers (or tocapture probes) for attaching target DNA to the surface.

A. Slide Setup

Four 5′-NH₂-modified oligonucleotides (see Table 1 for sequences) thatcan serve as targets or primers or capture probe were spotted at 7different concentrations (1, 5, 10, 25, 50, 75, 90 pmole/ul) on a1,4-phenylene diisothiocyanate derivatized slide, and each concentrationwas repeated 6 times. The long Tgt2-Tgt1-rc oligonucleotide contains theentire Tgt2 sequence and a portion of a sequence complementary to Tgt1(underlined portions are complementary in the anti-parallelorientation). Tgt2-Tgt1-rc was used as a test target that can becaptured by Tgt1 and the efficacy of capture can be tested by comparing2-probe ligation to Tgt2 sequence directly spotted and captured by Tgt1.

TABLE 1 Oligonucleotides used as targets or primers or capture probesSEQ Primer name Sequence ID NO: Tgt1NH-C6-C18-C18-CCGATCTTAGCAACGCATACAAACGTCAGT-3′ 1 (30 mer) Tgt2NH-C6-C18-C18-TTCGACACGTCCAGGAACGTGCTTCAATGA-3′ 2 (30 mer) Tgt3NH-C6-C18-C18-GTCAACTGTACCTATTCAGTCACTACTCAT-3′ 3 (30 mer) Tgt4NH-C6-C18-C18-CAGCAGTACGATTCATACTTGCATAT-3′ 4 (26 mer) Tgt2- Tgt1-rcTTCGACACGTCCAGGAACGTGCTTCAATGAACTGACGTTTGTATGCGTTG-3′ 5

B. Experiment 1

Hybridization/Ligation was carried out in a closed chamber at roomtemperature for 1 hour. The reaction solution contained 50 mM Tris,0.025 units/μ1 T4 ligase (Epicentre, Madison, Wis.), and 0.1 mg/ml BSA,10 mM MgCl₂, 1 mM ATP, pH 7.8 and varying amount of ligation probe pools(see, Table 2) from 0.005 to 0.5 pmole/μ1. After reaction, the slide waswashed by 3×SSPE for 30 minutes at 45° C., then rinsed with ddH₂O 3times and spun dry. These slides were then scanned at Axon GenePix4000Awith PMT setting at 600 mV.

TABLE 2 Ligation probe pools Pool 1 FM-pool SMM1-pool SMM2-poolTgt1-5′-probe 5′-NNNTGTATG 5′-NNNTGTAAG 5′-NNNTGTATG (SEQ ID NO: 6)(SEQ ID NO: 7) (SEQ ID NO: 6) Tgt 1-3′-probe 5′-CGTTGNN-* 5′-CGTTGNN-*5′-CGATGNN-* (SEQ ID NO: 8) (SEQ ID NO: 8) (SEQ ID NO: 9) Tgt2-5′-probe5′-NNNCACGTT 5′-NNNCACGAT 5′-NNNCACGTT (SEQ ID NO: 10) (SEQ ID NO: 11)(SEQ ID NO: 10) Tgt2-3′-probe 5′-CCTGGNN-* 5′-CCTGGNN-* 5′-CCAGGNN-*(SEQ ID NO: 12) (SEQ ID NO: 12) (SEQ ID NO: 13) Tgt3-5′-probe5′-NNNGACTGA 5′-NNNGACTCA 5′-NNNGACTGA (SEQ ID NO: 14) (SEQ ID NO: 15)(SEQ ID NO: 14) Tgt3-3′-probe 5′-ATAGGNN-* 5′-ATAGGNN-* 5′-ATCGGNN-*(SEQ ID NO: 16) (SEQ ID NO: 16) (SEQ ID NO: 17) Tgt4-5′-probe5′-NNNGTATGA 5′-NNNGTATCA 5′-NNNGTATGA (SEQ ID NO: 18) (SEQ ID NO: 19)(SEQ ID NO: 18) Tgt4-3′-probe 5′-ATCGTNN-* 5′-ATCGTNN-* 5′-ATGGTNN-*(SEQ ID NO: 20) (SEQ ID NO: 20) (SEQ ID NO: 21) Note: *indicates Tamralabeled, the underlined base indicates the position of single mismatch.

C. Experiment 2

A slide spotted with four NH₂-modified 26-30-mers was hybridized with 1pmole of long target Tgt2-Tgt1-rc (Table 1) in 20 μl of 50 mM Tris, and0.1 mg/ml BSA, 10 mM MgCl₂, pH 7.8 at room temperature for 2 hour. Theslide was washed with 6×SSPE at 45° C. for 30 minutes, and thenincubated with ligation probes (Tgt2-5′-probe and Tgt2-3′-probe, Table2) at room temperature for 1 hour in the presence of 0.5 Unit/20 μl ofT4 ligase. After the reaction, slide was washed and scanned as describedabove.

D. Results

1. Ligation signal depends on the concentration of spotted targets andthe concentrations of the 5′probe and 3′probe in the reaction solution.

FIGS. 12A, 12B, 12C and 12D show the ligation signal dependence onspotted targets Tgt1, Tgt2, Tgt3, and Tgt4 respectively, and ligationprobes in the solution. The highest signal was achieved when spottedtarget concentration was approximately 75 pmole/μl, and ligation probes(probe-5′ and probe-3′) were approximately 1 pmole in 20 μl of reactionsolution. These dependencies indicate that the observed signals wereactually ligation-depend signals and spotted target can be used as atemplate for ligation. Discrimination between full match ligation probeand single mismatch probe was about 4-20 fold (Table 3).

TABLE 3 Full match and single mismatch discrimination of ligation signalFM/SMM FM/SMM Target of 5′-probe of 3′-probe Tgtl 14 20 Tgt2 7 12 Tgt3 916 Tgt4 4 4

2. Spotted Oligonucleotides can be Used as a Primer (or Capture Probe)to Efficiently Attach Target DNA.

Oligonucleotide 1 (Tgt1) spotted on the slide served as a capture probefor target Tgt2-Tgt1-rc, which comprises a section of reverse complementsequence of Tgt1 at its 3′-side, and a Tgt2 sequence at its 5′-side.After hybridization/capture of Tgt2-Tgt1-rc, the ligation probes(Tgt2-5′-probe and Tgt2-3′-probe) were hybridized/ligated on the dots ofthe Tgt2 target as well on the dots with the Tgt1 target. The observedligation signals are shown in FIG. 13. Clearly, at this condition,spotted target can be used as a primer (or capture probe) to attachtarget DNA in the form available for hybridization/ligation of shortprobes used for sequence determination.

1. An apparatus for determining sequence information for a targetnucleic acid by probe hybridization, comprising: a sample integrationmodule configured for mixing, introducing, and/or removing reagents; adisposable plug-in reaction cartridge configured for contacting an arrayof target nucleic acid fragments with probe pools, wherein the reactioncartridge comprises a slot for securing an array of single DNA moleculesor amplicons, and quick connect ports for flow-through connection to thesample integration module; a subsystem configured for illuminatingfluorophores on an array in the reaction cartridge; and a subsystemconfigured for detecting fluorophores on an array in the reactioncartridge.
 2. The apparatus of claim 1, wherein the sample integrationmodule is configured for arraying fragments of a target nucleic acid ina substrate.
 3. The apparatus of claim 1, wherein the reaction cartridgecomprises a mixing chamber connected to a plurality of probe poolreservoirs by means of a single microfluidic channel.
 4. The apparatusof claim 1, wherein the illuminating subsystem is configured to create a100 to 500 nm thick evanescent field at the interface of two opticallydifferent materials.
 5. The apparatus of claim 1, wherein the detectingsubsystem is a sensitive electron multiplying charge-coupled device(CCD) configured for detection of fluorophores on the array.
 6. Theapparatus of claim 1, further comprising probe modules that areconfigured for delivering fluorescently labeled probes to the apparatus.7. A system for determining sequence information for a target nucleicacid comprising an apparatus according to claim 1, and a plurality ofprobe pools.
 8. The system of claim 7, wherein the probes in the probepools contain a label and a nucleotide sequence comprising the formulaN_(x)B_(y)N_(z), the formula N_(x)B_(y) or the formula B_(y)N_(z),wherein: (i) each N is independently a degenerate base wherein Nrepresents any of the four nucleotide bases and varies between probes ineach of said probe pools; (ii) each B is independently an informativebase, wherein B is the same base for probes in each of said probe pools;(iii) x and z are each at least one.
 9. The system of claim 8, furthercomprising a computer programmed for parallel processing of data fromthe subsystem configured for detecting fluorophores on an array in thereaction cartridge.
 10. The system of claim 8, further comprising anarray of fragments of a target nucleic acid configured for hybridizingwith the probe pools.